arXiv:1508.07888vl [astro-ph.IM] 17 Aug 2015 


Mon. Not. R. Astron. Soc. 000, 000-000 (0000) Printed 1 September 2015 (MN IATeX style file v2.2) 


EMMA: an AMR cosmological simulation code with 
radiative transfer 

Dominique Aubert^*, Nicolas Deparis^ and Pierre Ocvirk^ 

^ Observatoire Astronomique de Strasbourg, CNRS UMR 7550, Universite de Strasbourg, Strasbourg,France 


1 September 2015 


ABSTRACT 

EMMA is a cosmological simulation code aimed at investigating the reionization epoch. 
It handles simultaneously collisionless and gas dynamics, as well as radiative transfer 
physics using a moment-based description with the Ml approximation. Field quantities 
are stored and computed on an adaptive 3D mesh and the spatial resolution can be 
dynamically modified based on physically-motivated criteria. 

Physical processes can be coupled at all spatial and temporal scales. We also intro¬ 
duce a new and optional approximation to handle radiation : the light is transported 
at the resolution of the non-rehned grid and only once the dynamics have been fully 
updated, whereas thermo-chemical processes are still tracked on the rehned elements. 
Such an approximation reduces the overheads induced by the treatment of radiation 
physics. A suite of standard tests are presented and passed by EMMA, providing a vali¬ 
dation for its future use in studies of the reionization epoch. 

The code is parallel and is able to use graphics processing units (GPUs) to accel¬ 
erate hydrodynamics and radiative transfer calculations. Depending on the optimiza¬ 
tions and the compilers used to generate the CPU reference, global GPU acceleration 
factors between x3.9 and xl6.9 can be obtained. Vectorization and transfer operations 
currently prevent better GPU performances and we expect that future optimizations 
and hardware evolution will lead to greater accelerations. 
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1 INTRODUCTION 

Starting in the 70s, numerical simulations have been part of 
the astrophysicists tool kit to investigate regimes where non- 
linearities, strong coupling between different physics and 
multi-scale processes are dominant. The study of structure 
formation in a cosmological context is an archetypal ex¬ 
ample of such intricacies and has thus driven the rise and 
the growth of the so-called ’cosmological simulations’ for 
decades now. These simulations have successively included 
collisionless dynamics and hydrodynamics, and implements 
routinely sub-resolution physics such as star formation or 
feedback from supernovae to model the galaxy formation 
process. 

Recently, the challenges encountered in galaxy forma¬ 
tion theory and the near advent of new observatories capa¬ 
ble of investigating the Universe at redshifts z > 6 (such 
as JWST or SKA) produced an interest toward the study of 
the reionization epoch. Interesting in its own sake, as a great 
cosmic transition produced by the first sources and capable 
of reionizing the Universe, this epoch will be studied in-situ 
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and will provide insights on the initial stages of the struc¬ 


ture formation process (see e.g. reviews by Barkana & Loeb 
2001 Pritchard fc Loe'b||2012 ). Reionization is also thought 
to provide keys to understand the current state of galaxies, 
with the rise of a UV background capable of suppressing star 
formation in light objects (see e.g. Gnedin|2000| Hoeft et al. 


2006 Finlator et al.||2011[ Wise et al.||2012 ). Consequently, 

numerical simulations of the reionization started to appear 
15 years ago to assess this process and prepare the advent 
of observational constraints (see Trac &: Gnedin|[2011 for a 
review). 


The common feature of such simulations is the inclusion 
of radiative transfer physics, to model the impact of the UV 
photons emitted by the first sources. Quite demanding in 
terms of computing resources, radiative transfer is often in¬ 
cluded as a post-processing step on outputs of simulations 
to reduce this cost : absorbents are static and ad-hoc recipes 
are included to model the negative feedback of radiation on 


sources (see e.g 

Ciardi et al.|2003| Iliev et al.|2006 

|McQuinn 

et al.|2007||Mel 

ema et al. 

20061 IZahnet al.|2007|| 

3aek et al. 

20101 IChardin et al.|2012 

lOcvirk et al.|2013| |Paardekooper 

et al. 120131 |Ocvirk et al.| 

2014| Zawada et al.[|2014|). Never- 


theless, radiative hydrodynamics codes started to be imple- 
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mented (like recently e.g Finlator et al. TM ^ [Rosdahl et al. 


2013||Pawlik et al.||2015 ). They cope with the reduced time 

scales (and therefore increased CPU cost) produced by radi¬ 


ation physics using massive parallelism (see e.g. Trac fc Cen 
2QQ7| Norman et al.|2Qf^, methodological sol utions (such as 


implicit solvers like e.g. [Gonzalez et al.|2007| [Norman et al.| 

2015 ) or specihc physical regimes (such as a reduced speed 
of light like e.g. Gnedin|[2014| Rosdahl et al.|[20T3 ). What¬ 
ever will be the solution, radiation will surely become an 
additional standard component of future cosmological sim¬ 
ulation codes. This evolution is greatly illustrated by the 
number of participating codes to comparison projects such 
Iliev et al. ( 20Q6| and Iliev et al. (2009). 


In this context, a new cosmological simulation code is 
presented here, EMM^ that includes collisionless physics, hy¬ 
drodynamics and radiative transfer. It builds upon the ex¬ 


perience gathered with previous codes such as ATON ( Aubert 
Teyssier|2008| 2010) or a particle-mesh N-Body only code 


(Aubert et al. 2009). Its ambition is to be able to tackle 


structure formation experiments with a special emphasis on 
the influence of radiation during the reionization epoch. EMMA 
is an adaptive mesh rehnement code, parallel and with GPU- 
driven acceleration on a selection of its sub-modules, mainly 
the physics engines. As such, it shares a number of features 
with ATON and extend its held of application to coupled ra¬ 
diative hydrodynamics and to high resolution through mesh 
rehnement. 

The full methodology of EMMA is hrst described: details 
on how data is managed, how the physics is solved and 
how parallelisation is implemented are presented. Then we 
present a selection of validations tests, from pure dark mat¬ 
ter experiments to more complex situations with coupled 
physics. Finally we discuss the performances and the poten¬ 
tial future developments of EMMA. 


2 METHODOLOGY 

2.1 Adaptive Mesh implementation 


EMMA relies on a grid-based description of the physical quanti¬ 
ties and implements adaptive mesh rehnement (AMR) tech¬ 
niques to increase dynamically its spatial resolution. The lat¬ 
ter is typically of the order of a few cells, for all the diherent 
physics presented here, whether it is set by the smoothing 
of the gravitational force or the smearing of hydrodynam- 
ical shocks or ionization fronts. AMR permits to increase 
this resolution with a moderate cost in terms of memory 
consumption by providing hner meshes only at selected lo¬ 
cations. 

Several AMR implementation exists and EMMA adopts a 
Fully Threaded Tree (FTT, [Miokhlov ( I998| ) description of 
the data, sharing this core feature with other codes such as 
ART ( Kravtsov et al.|I997 ) or RAMSES ( Teyssier|20Q2 ). In this 
framework, a fundamental grid divides a 3D space (usually 
cubic but not necessarily) in 2^^^ cells where designates 
its rehnement level and where the cells are arranged in a 
Gartesian manner. For instance = 7 corresponds to a 
fundamental 128^ grid where each direction is sampled along 
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128 points. Hereafter this fundamental coarsest grid will be 
referred as the base grid and ic is the base level. 

These base level cells are the roots of oct-trees that fully 
describe the rehned geometry of space. If a cell is rehned, 
it points towards an octal structure (or oct) that in turns 
points toward eight additional cells belonging to the + I 
level. This hierarchy can be recursively maintained until a 
satisfying resolution (i.e. rehnement level) is achieved. Gon- 
versely, base level cells can also be grouped in octs, that 
belong to coarser cells of level ic — 1- Recursively, this pro¬ 
cess can be repeated until a single i = 0 cell is obtained. 
This specihc cell is therefore the root of the global AMR 
structure that samples the 3D space where the numerical 
experiments take place. 

In practice, the data is organized using octs as the fun¬ 
damental structures where a level i oct contains the fol¬ 
lowing informations: 

• the level i, 

• 8 cells cf with i G [0, 7], 

• a pointer toward its parent cell 

• 6 pointers toward the 6 Gartesian neighbors cells of 

• 2 pointers toward the next and the previous octs in 
memory that belong to the same level i. Note that the ar¬ 
rangement of octs in memory is not related to spatial distri¬ 
bution of the octs in the computational domain. With this 
feature, octs are stored as a doubly linked list 

A cell cf, contains the following informations: 

• its index z G [0, 7], 

• the physical data at this location (density, potential, 
pressure, ionized fraction, etc...), 

• a pointer toward an oct if it is rehned, 

Octs are mainly used for data exploration and man¬ 
agement. They are the mandatory intermediates in order to 
probe the neighbors of a a given cell. They are also the in¬ 
termediate structure to scan all the cells at a given i (i.e. 
at a given spatial resolution) as they store the pointers to¬ 
ward the previous and next member of the set of all the . 
This scanning operations are performed through the doubly 
linked list. 

Gells are mainly used to store physical data. They also 
store a pointer toward an oct if they are rehned but that is 
all the relational information that they possess. Any query 
on a neighbor cell must be passed through their parent oct. 
This description reduces greatly (by a factor 8 typically) the 
number of relational pointers that would be required if each 
cell possessed its own set of neighbors: such an organiza¬ 
tion takes advantage of the fact that all the cells of an oct 
share a signihcant fraction of neighbors, at the cost of some 
operations to retrieve them via the octs. 

Since geometrical relations between neighbors are ex¬ 
plicit, the mesh can be arbitrarily rehned without any con¬ 
straint on the geometrical strategy of this rehnement. Also 
there is no simple mapping between the position of an oct 
in memory and its geometrical location. As a consequence, 
the fully threaded tree can be extremely versatile and follow 
closely any physical feature that requires high resolution at 
the cost of storing additional information. 

In rehned grids, resolution jumps require special care. 
In particular these interface are source of errors and inac- 
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curacies that must be controlled as much as possible. One 
standard requirement that must be enforced in the current 
description of AMR data is the fact that resolution jumps 
cannot be greater than unity for two neighbor cells. Hence 
the six neighbor pointers of an oct necessarily exist. How¬ 
ever it also puts constraints on the way cells are refined and 
the procedure takes place as follow, in that order: 

(i) an oct containing a cell marked for rehnement or al¬ 
ready refined must be maintained to ensure that resolu¬ 
tion jumps are smaller than two. Its parent cell is therefore 
marked for refinement. 

(ii) if a cell is marked for refinement, so must be its 26 
neighbors. It ensures a smooth transition between low and 
high resolution regions by creating a buffer of high resolu¬ 
tion cells at the interface and also prevents large resolution 
jumps. 

(iii) if a cell satisfies some user-dehned criterion, it is 
marked for refinement. This criterion can be physically mo¬ 
tivated (based on physical values or gradients) or based on 
location (for zoom simulation for instance). 

In practice, step (i) is applied to all the cells that belong to 
a given level in a single pass. Afterwards the step (ii) cri¬ 
terion is likewise applied to all the cells of that level, then 
step (iii), resulting in 3 successive passes on all the cells of 
a given level. The marking procedure can be repeated an 
arbitrarily number of times, especially to increase the thick¬ 
ness of the buffer regions at the transition between regions 
at 2 different resolutions. In practice, this procedure is ap¬ 
plied twice, similarly to RAMSES or ART. It ensures that a cell 
that is refined on a physical basis (criterion 3 above) will be 
produced with an outer layer of 2 neighbor cells at the same 
resolution. Once this marking has been completed, cells can 
be rehned by creating a new oct to be attached to it or 
coarsened by suppressing its child oct. When refined, all the 
relations must be computed while the physical quantities are 
straightforwardly injected from coarse values. 


2.2 Solvers 

EMMA tracks the evolution of 3 ‘fluids’, coupled to each other. 
First a collisionless fluid, sampled by particles in a standard 
particle-in-cell description. It is aimed at modeling the dy¬ 
namics of stars and non-baryonic dark matter. Second, the 
code solves the Euler equations to describe the gas dynam¬ 
ics. We chose to follow RAMSES by using a piecewise linear 
method ’a la’ MUSCL-Hancock driven by HLLC Riemann 
solvers. Third, EMMA models the propagation of radiation us¬ 
ing a moment description of the radiative transfer (RT here¬ 
after) equation with the Ml closure relation. Preliminary 
atomic processes are also included to describe the cooling 
and ionization that takes place on top of the adiabatic evo¬ 
lution of the gas. Finally, EMMA deals with cosmological set¬ 
tings through the use of supercomoving coordinates, which 
are briefly discussed. 


2.2.1 Collisionless dynamics 

Collisionless dynamics are handled through a Monte-Carlo 


sampling of phase space using particles (see e.g. Hockney & 


Eastwood (1981) for a reference). Each particle consists in 


a structure of data that contains: 


(i) its fundamental properties: mass, 3D position r = 
X, y, z and velocity v = Ux, 

(ii) the level of the cell it belongs to, 

(iii) 2 pointer towards the next and the previous particle 
that belong to the same cell. 


Additionally, if collisionless dynamics is included, a non- 
refined cell contains a pointer toward the first particle it 
contains. If a cell is split, the particles of the coarse cell are 
passed to the newly created cells. If an oct is destroyed, all 
the particles are assigned to the parent cell. By means of 
linked lists, the code can therefore scan all the particles of 
a given cell. 

Each particle is evolved in phase space thanks to the 
usual Newton’s equations : 


dv 

dt 

= -V<^(r), 

(1) 

A<f)(r) 

= AnGpir), 

(2) 


where p(r) stands for the 3D total matter density and 0(r) 
is the associated gravitational potential. Updating the phase 
space coordinates of a particle therefore implies an evalua¬ 
tion of the density and the potential fields on the refined 
mesh. 

The density is computed in all cells of the AMR grid 
using a standard cloud-in-cell (CIC) interpolation scheme. In 
practice it is performed level by level, according to the time 
stepping procedure described in section |2.3| If an unsplit 
cell cf contains particles, they are assigned to cells of the 
same level £ according to the CIC scheme. Furthermore if a 
neighbor cell is split, the contribution of the particles in its 
-f 1 cells is also taken in account to compute the density of 
cf. Conversely, if a neighbor cell is coarser and unsplit, its 
particles will also contribute to the density of cf. In these two 
situations, the CIC extent of the particles will correspond 
to the level i resolution, resulting in a CIC density equals to 
the one that would be obtained from a uniform grid at level 
£. Finally if cf is split, its density is computed by averaging 
the 8 values of its child cells. 

The density being available, the potential is obtained 
by solving the Poisson equation. In the case of EMMA, the so¬ 
lution is obtained by means of relaxation techniques. They 
allow for a great flexibility (notably compared to FFTs based 
approaches) in dealing with complex domain geometries, ar¬ 
bitrary boundary conditions and grids with multiple resolu¬ 
tions. One can use the simple Jacobi iteration formula where 
the estimation p + 1 for the potential in the cell of Carte¬ 
sian coordinates (z,j,/c) can be computed from a previous 
estimation p via: 




Af+i + ■ 


+ 




>U + ' 


>^+ 1 +' 




AtvGAx^ p 
6 


(3) 

(4) 


The convergence rate of this process is slow, especially to 
achieve convergence on scales comparable to the total vol¬ 
ume. As can be seen in eq.|^ information can only be prop¬ 
agated from one cell to the next one, making such formula 
inefficient at determining the low frequency modes of the 
solution. It can be marginally accelerated by means of over¬ 
relaxation or by using the Gauss-Seidel iteration technique, 
analog to Jacobi but where new estimates of the potential 
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are used on the RHS of equation]^ as soon as they are 
available instead of cjf. 

Greater acceleration rates can be obtained by using a 
multigrid algorithm. The first stage is a ‘smoothing opera¬ 
tion’ where a simple relaxation formula such as eq.|^is used 
to remove spurious high frequency features in the current 
estimation of the potential, i.e. in the regime of scales it 
is the most efficient. Then this estimate of cj)^ is coarsened 
through a restriction operation and applied as a first guess 
for a new relaxation stage, which is more efficient at low res¬ 
olution for two reasons : first the number of sampling points 
is reduced (typically by a factor of 8 in 3D) and second, the 
convergence of low frequency modes is increased as their ef¬ 
fective scale is reduced relative to the influence radius of eq. 
1^ Once this low resolution solution is converged, it is in¬ 
terpolated back (also called prolongation step) and after a 
few relaxation steps at full resolution a converged solution 
is obtained. 

This two level-process (between I and ^ — 1) can be 
recursively applied to ^ — 1 and 1 — 2 and so on. In EMMA, we 
typically proceed until I = 2 (i.e. using a 4^ held) to achieve a 
10“3 convergence of the solution with only ~ 5 iterations at 
full resolution. We take advantage of the AMR structure to 
store residuals and perform the calculations on coarse levels. 
Prolongation is performed via linear interpolation, whereas 
restriction is done by averaging on the 8 cells of an oct. 

For hne levels with i > ic, the potential is computed by 
simple relaxation, using a red-black Gauss-Seidel smoother, 
with boundary conditions provided by the £ — 1 cells that 
surround high resolution patches. However, the initial po¬ 
tential evaluation is obtained by interpolation of the coarse 
solution. Therefore, convergence is quickly obtained from 
this hrst value that is already close to the correct one. For 
EMMA, convergence at the 10“^ level are obtained after 10 
iterations in a typical cosmological simulation. Fine cells at 
resolution jumps must compute the potential using high res¬ 
olution values that do not exist on the coarse side of the 
jumps. In this case, the potential is interpolated at the re¬ 
quested location from its coarse value. 

The potential being available, the acceleration held 
f = — V0 is computed by simple derivation and is interpo¬ 
lated back at the particles positions again accordingly to the 
GIG scheme. If a particle lies close to the interface between 
two regions at different levels i and £ — 1, the force held is 
interpolated from the coarsest £—1 level. As such, it implies 
that a particle must penetrate signihcantly within high res¬ 
olution regions to be sensitive to their force held, otherwise 
they will be driven by a lower resolution description of the 
potential. 

Particles are advanced thanks to a mid-point scheme. 
The p + 1 value of the position and the velocities are com¬ 
puted from their p value as: 

Tp+I = Tp + Vp+i/zAiP (5) 

where the mid-point velocity is provided by 

Vp+l/2 =Vp+fp— (6) 

and hnally corrected as: 

At^ 

Vp+l = Vp+i/2 + fp+1 (7) 


2.2.2 Hydrodynamics 


Hydrodynamics are solved through a Eulerian description 
of conserved huid quantities, which obey the set of Euler 
equations (see e.g. |T^ ( |T9W| |): 

au aF(u) aG(u) aH(u) 


dt 


dx 


dy 


dz 


= S, 


( 8 ) 


where U = (p, pu, pv, pw, E) is the array of conserved quan¬ 
tities, the density, the 3 components of momentum and the 
energy. F, G and H are the 3 hux functions with 

F(U) = {pu, pu"^ p, puv, puw,u{Ep)) 


G(U) = (pv, puv, pv^p, pvw,v{Ep)) 
H(U) = {pw, puw, pvw, pw^p,w{Ep)) 


( 9 ) 
( 10 ) 
( 11 ) 

In association with conserved quantities, this is usual to also 
consider the primitive quantities, namely the density p, the 
velocities v = (u,v,w) and the pressure p. The total energy 
is given by 

+ (12) 

with contribution of the kinetic and the internal energy. Here 
7 is the usual adiabatic exponent, equal to 5/3 for an ideal 
mono-atomic gas. The High-Mach flow regime is taken in ac¬ 


count using the recipe described in Rasera & Teyssier (2006) 


The set of Euler equation is solved in a split fashion, 
dealing first with the pure transport part with a null r.h.s in 
eq.j^then updating the solution by adding the contribution 
of source terms. The transport update of into is 

done explicitly (in ID here for simplicity) via: 


Ui 


p+i _ 


At 


— l]P A-lAL (T-P _ ttp ^ 


(13) 


This solution requires the knowledge of intercell fluxes 
^i±il 2 instant p between cells i and i ± 1/2, through 
the resolution of Riemann problems. Here we use a MUSGL 
scheme coupled to an HLLG Riemann Solver (see e.g. 


Toro 


|et al.| |l994 ); Toro (1997)). Eirst, conserved quantities are 


linearly reconstructed at the cell boundaries (in ID for sim- 
plicity) : 


uy^ = uf ±^. 


A,; 


(14) 


Here L/R designates the left and right reconstructed states 
at the cell i boundaries in x = 0 and x = Ax, the cell 
center being in Ax/2. Ai is the slope vector of the conserved 
quantities and is computed using neighbor cells values and 
a MinMod limiter to ensure a monotonic solution. Before 
solving the Riemann problem, the boundary extrapolated 
values are also evolved by a time At/2 : 


^L/R ^ ^L/R ^ ^ ) - F(Uf)] . 


2Ax 


(15) 


The Riemann problem at intercell positions are solved using 
these evolved states, for instance obtained from 

states Uf and Uf/i- The MUSGL scheme achieves second 


order accuracy ( Toro et al.||1994 Toro||199~ ). As discussed 
previously, when dealing with cells at the interface between 
2 regions with different resolutions, coarse quantities are in¬ 
terpolated in a conservative manner at the locations of the 
’virtual’ hne cells. At such interfaces, the low resolution val¬ 
ues are assumed to be constant in time and accuracy re¬ 


duces to hrst order (Khokhlov 1998 Teyssier 2002). The 
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multi-dimensionality of the problem is dealt through an un¬ 
split approach with a flux contribution of the 3 directions 
included at once in the update. 

The stability of the solution is obtained by enforcing 
the Courant condition on the time step 

= (16) 

where Vh is an estimate of the largest wave speed present 
throughout the computational domain and C < 1. We follow 
the simple suggestion of Toro with 

Vh = max{|v|i + ai} (17) 

where ai stands for the sound speed at location i. It usually 
overestimates the limiting velocity and therefore underesti¬ 
mates the corresponding time step, but provides a robust 
estimation. 

The source term in Eq. models deviation to pure 
conservation. In our case S holds the contribution of the 
gravitational force to momentum and energy variation and 
expresses the coupling of gravitation with hydrodynamics : 

S = iO,-pV<j>,-pvV<p), (18) 

where 0 is the gravitational potential. Its inclusion in the 
solution (eq. is done in a ’explicit’ manner with 

s = (0, (19) 

This contribution modifies the conservative quantities af¬ 
ter the pure transport update. However they are taken in 
account within the MUSCL scheme to compute the interpo¬ 
lated left/right states before solving the Riemann problems. 
Of course hydrodynamics is also coupled to radiative physics 
and thermo-chemistry via the internal energy (or equiva¬ 
lently the pressure) of the gas : photo-heating and cooling 
act as source and sink terms of this quantity and are treated 
by the radiative transfer engine. Finally, the hydro-engine is 
able to handle passive scalars that are being advected with 
the fluid. Among such scalars, one can currently find the hy¬ 
drogen neutral fraction or in forthcoming developments one 
can think of the metallicity. 


2.2.3 Radiative Transfer 


Propagation of radiation is dealt with using a moment based 
description : light is described as fluid, where its phase space 
distribution is averaged on velocities to focus on spatial 
fields. Among this family of radiation description, EMMA re¬ 


lies on the MI approximation ( Levermore||I984| |Gonzalez 
et al.||2007 Aubert fc Teyssier||2Q08 ). 


Taking the first two moments of the Liouville Equation 
leads to the conservation equations of the ionizing photons 
density Njp{r,t) and flux Fi,(r,t), in each bin of frequency 


dN, ^ 
dt ^ dr 
dFu 


— KpFjp. 


( 20 ) 

( 21 ) 


Here, Piy(r, t) stands for the radiative pressure tensor. This 
system of equation being opened, a closure relation is re¬ 
quired to be able to solve it. The Ml closure relation is 


given by: 

P. = (22) 

D, = (23) 

Here, stands for the Eddington tensor and its expression 
is set by the value of the y quantity that varies between 1/3 
for a diffusive regime (i.e. iT ^ cNjy) and 1 for a pure 
transport regime (i.e. Fjp ~ cNj^). 

The quantities on the r.h.s. of the conservation equa¬ 
tions are the source of photons Su{r,t) (expressed in pho¬ 
tons per unit time per unit volume) and the two absorption 
terms, driven by the absorption coefficients kn and kf, con¬ 
sidered equal in EMMA with kn = ca^nH- This terms couple 
the hydrodynamics and the radiative transfer by means of 
gas absorption with nn being the density number of neutral 
gas and the photo-ionization cross-section at frequency 


In the case of multi-frequency transfer, photons are 
gathered in so-called ’groups’ of frequencies and the flux and 
number densities of a given group satisfy the above conserva¬ 
tive equations. In practice, Eqs|20|and |21| can be integrated 
between two frequencies: 


ON OF 

dt dr 

= S — C(7NTiHNi 

(24) 

2 ^ 
dt ^ dr 

— —ccrw^rrF, 

(25) 

with 

r^2 

N = / 

Nj.dh' 

(26) 

II 

to 

Fjpdiy 

(27) 

' ■ 

II 

Sj^diy 

(28) 

II 

b 

r^2 

' auNudu. 

(29) 


Typical frequency groups have limits set by the ionization 
levels of hydrogen and helium (even though EMMA does not 
currently handle Helium chemistry) i.e. [13.6,24.6, 54.4] eV 
or chosen to represent broad classes of different types of 
radiation such as UV, X and hard X-rays. 

In practice, the update of radiative quantities is a two 
stages process : first a conservative transport is performed, 
then non-conservative contributions (source and sinks) are 
added within a subsequent thermo-chemical solver (see Sec. 


2.2.4). The set of homogeneous (with zero r.h.s) coupled 
equation is solved for U = (iV, F) using fluxes F = (F, c^P) 
with a simple explicit finite difference scheme (in ID for sake 
of simplicity): 


TJP+I _ TJP I _ TTP ^ 

that must satisfies the usual Courant Condition: 
Ax 

C ^ . 

At 


(30) 


(31) 


Here -^f_i/ 2 (U) represents the flux function at instant p 
measured at the interface between the cell i and i — 1. This 
intercell flux is obtained by solving a typical Riemann prob- 
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lem at this interface : in EMMA, this flux is given by the Lax- 
Friedrich Formula: 


•^i+l/2(U) 


Fi + Fi 


i + l 


Ui+1 - Ui 


(32) 


As in ATON, an implementation of the less diffusive HLL flux 
is also on the way. At this stage, conservative transport is 
completed and radiative quantities (N,F) are in an interme¬ 
diate state, waiting for the contribution of source and sinks 
to be taken in account, as explained in the next section. 

Before, it should be noted that the Courant condition 
ensures the stability of the scheme but impose that the nu¬ 
merical ’sampling velocity’ Ax/At must be greater than the 
speed of light. The cost of simplicity provided by the explicit 
solver is therefore a very fine temporal sampling, greatly en¬ 
hancing the CPU-cost of the radiative transfer. This strin¬ 
gent constrains on the time step imposed by the Courant 
condition can be circumvented in two ways. The first one 
is simply by taking advantage of hardware acceleration (as 
in ATON). In EMMA this route is explored with GPUs as ac¬ 
celerating devices and described in section [4| The s econd 


one is to reduce the speed of light (Gnedin & Abel 2001 
Rosdahl et al.||2013 ), taking advantage of the fact that 


large number of situations, the effective propagation of ra¬ 
diative information is performed through ionization fronts 
that propagate at smaller pace. This option can also be set 
in EMMA, as done for instance in Sec. |6.6.1| 


2 . 2.4 Thermal and chemical processes 


group of frequencies of interest: 
1 


CTe = 


N{E} 


f 


auNjyhiydiy, 


1 r 

«/ u-\ 


Nuhudu 


(37) 


(38) 


where the latter quantity (E) is the average photon energy 
in the same group. 

In the case of multi-frequency transfer, Eqs ( 33][36 ) are 
solved for each frequency interval (i.e. group) where 

i stands for a group label. Within each frequency group i the 
corresponding photon density and fluxes (W,Fi) satisfy : 

^ = Si-caN,inHNi + {aA{T)-aB{T))6i,ix\U39) 

dFi 


—rr — —caN,inHFi. 

dt 


(40) 


These equations depend on the group cross-section aN,i and 
source function Si obtained using Eqs and on the 
[lAi, lAi-^-i] interval. The number of radiative conservative up¬ 
dates therefore scales as the number of frequency groups. 
The Kronecker symbol Si^i in Eq. implies that the re¬ 
combining radiation only contributes to the first group of 
ionizing photons, the closest to the hydrogen ionization fre¬ 


quency (see e.g. Rosdahl et al. (2013)). Again, if the on-the- 
spot approximation is used, this contribution is set to zero 
for all the groups. The thermo-chemical equations Eqs. |35| 
and|36|are also modified and the following expressions must 
be replaced: 


The thermal and ’chemical’ processes encompass the atomic 
physics that will affect hydro and radiative quantities. Cur¬ 
rently EMMA only handles atomic hydrogen processes. They 
contribute to change the number density and flux of photons 
(source and sinks), the ionization state of the gas and finally 
its internal energy: 

^ = S-caNnHN + {aAiT)-aB{T))x\l, (33) 

r/F 

— = -caNUnF, (34) 

^ = iaA{T)x^-l3{T)xil-x))nl-cnHCTNN, (35) 

de 

— = cuhT^eN - A{no,x,T), (36) 

where nn = (1 — x)p/mp = (1 — x)no is the number density 
of hydrogen atoms (with individual mass rup), x being their 
ionized fraction and no being the total number of protons 
(i.e. neutral + ionized hydrogen). The quantities aA/B and 
[3 are respectively the case A/B recombination and colli- 
sional ionization rates. Eqs. [33| and [34| describe the influence 
of source and sinks on the radiative quantities : in particu¬ 
lar the last term in Eq. |33| refers to the recombining ionizing 
radiation. Note that the on-the-spot approximation can eas¬ 
ily be applied by setting a a = as in Eqs. |33| and |35| Eq. 
|35| details the competing effects of recombination and ion¬ 
izations on the number of neutral hydrogen atoms. Eq. 
encodes the evolution of the internal energy density of the 
gas e = p /(7 — 1) due to atomic cooling (given by the cool¬ 
ing rate A) and to the photo-heating above the ionization 
threshold?^ = cuhNY^e- HereEs = (cr£;(E) —crwEis.e) and 
the quantity crs is the energy averaged cross-section over the 


Ngroups 

cuhcfnN CUH ^ crN,iNi, (41) 

i=l 

Ngroups 

cuhTieN cuh ^ ^ TiE,iNi. (42) 

i=l 

Ngroups stands for the total number of frequency intervals 
considered. Eqs. and depend on the groups cross- 
sections and photons densities and effectively couple pho¬ 
tons of different frequencies that are otherwise transported 
without any interactions. 

In practice, we solve the equations in the same order as 
above by sub-cycling the dynamical step At with a chemical 
time step At. During the latter, all quantities are updated 
from state p to p + 1. At is first evaluated from the p state 
of the internal energy, with 6r = e^/ (TL^ — A^). Then all the 
quantities are updated in the following manner and in that 


order: 



jyP+l ^ 

NP + iS+ {aA{T) - aB{T))x‘^nl)AT 

(43) 

1 + ccrA((l — XTP)no 

pP+1 _ 

pp 

(44) 


1 + ccrA((l — x'P)no 

xP+^ = 

^ aA(T^)ic^^^oAT + 1 — 

1 + AT{j3{TP)xPno + CCTArA^P+i) 

(45) 

gP+1 ^ 

eP 

(46) 


+Atc( 1 - A+^)noiVP+^EB 

(47) 


-ArA(no,x*’+\T)). 

(48) 


In this sequence, new information on a given quantity is 
immediately used to compute a subsequent one. We follow 
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L+1 octs creation/destruction 


CIC 
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Advance Particles 


Hydrodynamics Solver 




Light sources creation 


Radiative transport 


Thermo-Chemistry 

I 
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Figure 1. Flow of sequence for a time step at a given level (from 
top to bottom). Side arrows describe the exchange of physical 
quantities between different modules to emphasize the most im¬ 
portant couplings. Green boxes stand for modules that have been 
ported on GPU using GUDA. It is assumed that At^ = 2Ati^i 
and the i 1 level is advanced twice. This sequence is repeated 
recursively for all the finer levels. 


Rosdahl et al. (2013) and enforce that the internal energy 


must at most vary by 10% (relatively) during At, otherwise 
the set of equation is recomputed with a time step divided 
by 2 until this condition is satisfied. In all our experiments, 
this procedure has been found to be accurate and robust 
enough. All the rates required to describe the atomic pro¬ 
cesses such as the recombination, the collisional ionization, 
and the cooling are taken from the compilation of |Theuns| 


et al. (1998). Photo-ionization cross-sections are taken from 


Hui & Gnedin 


(1997). 


2.3 Time stepping 

The organization of time step is intimately constrained by 
the multi-level structure of the data. A single level time step 
is organized in quite an usual fashion and described in Fig. 

Coupling between the different physics occur at different 
levels, the most explicit ones being : 

• between Gravity and Hydrodynamics through the total 
matter density and the gravitational force it creates, 

• between radiation and hydrodynamics through the den¬ 
sity distribution, or the gas temperature. 


t t-hdt 

Level L ^ 

Level L+1 ^ 


Level L+2 • -► • -► • -► • -► • 

^1 ^2 ^4 5 


Figure 2. The time-step hierarchy of 3 level of the AMR struc¬ 
ture. The coarse level i is advanced from t to which implies 

partial advance and temporary updates of fine levels i-\-l and i+2 
until they are synchronized with i. Arrows and numbers indicate 
the sequence of partial updates at different levels to perform this 
full update of the 3 nested levels. 


On top of these explicit coupling, detailed in the next sec¬ 
tion, implicit ones also occur where e.g. a photo- evaporating 
halo could in principle affect the underlying dark matter dis¬ 
tribution (in the same way supernovae feedback could affect 
it, even though they are not explicitly coupled, e.g. |Pontzen| 
GQvernatQ||2012 ) 

When mesh refinement is enabled, coupling between 
levels must be taken into account with special care. In EMMA, 
updates are performed level by level, each level being up¬ 
dated with its own time step At£. Typically, Ati ~ x 2 

{i corresponding to a level coarser than i 1). Advancing 
the solution on a level i can be expressed as the following 
recursive expression: 

Ai = Ri^iP^Ai^iUiMi, (49) 

where Ai stands for ’advancing the solution’ at level £. Ri 
is the refinement operator (i.e. creating/destroying ^ H- 1 
octs from i cells). Pi and Ui are the operators for the Pois¬ 
son resolution and the update (i.e. moving particles, up¬ 
dating Eulerian fields) at level i. Finally, Mi corresponds 
to the marking of the level £ cells for future refinements. 
The recursion is stopped at the maximal allowed level with 
R^max = M^max = ^£max+i = 1- For instancc, a simulation 
with three resolution levels (e.g. -£ = 5,6, 7) will be fully up¬ 
dated on Ats according to the following operations (Fig. 
details this step in a schematic manner) : 

Mb = R 5 P 5 [RePe [^ 7 ] UqMq] [RqPg [At] UqMg] 

Mr = P7U7P7U7 (50) 


where we assumed that At^ = 2Ate = dAtr. More generally, 
the time steps are constrained by : 

AtJ+i + At^+i ^ Ati. (51) 

Once the level -£ + 1 has been updated by At]^i + 
the value Ati is updated to synchronize the coarse level on 
the finer one. 

As originally described by Khokhlov (1998), this nested 
hierarchy of time step has some strong implication on the 
flux update at interfaces between two levels. When conser¬ 
vative quantities are updated on level £-\-l, adjacent £ cells 
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Figure 3. Multilevel update of fluxes at the interface between 
two levels. To perform an update from t to t + dt, flue level ^ + 1 
must perform 2 steps sub-cycled within a large step of the coarse 
level i. Updates are performed according to the 1,2,3 sequence : 
note how the middle coarse cell is updated during the 3 steps with 
flue fluxes first and a final coarse flux,during the level I update. 


must take in account the fluxes produced during the subcy¬ 
cling of level ^ + 1. For instance, if a conservative quantity 
Ug and %+i in two adjacent cells must be updated on At^, 
the following sequence must be obeyed (see also Fig. |^: 

(i) is updated to using a flux 

Meanwhile is temporarily updated using the same flux. 

(ii) is updated to using a flux 

,U^). Again is temporarily updated 

using the same flux. 

(iii) Finally, is fully updated taking in account the 

flux created by adjacent cells of level 1 . 


This partial update of coarse cells during fine time steps is 
one of the reason to enforce a maximal jump in resolution of 
unity between two adjacent cells. Larger jumps would create 
a complex hierarchy of partial update of coarse quantities, 
difficult to handle properly. 

The amplitude of time steps is set by physical temporal 
scales that must be resolved to track properly their impact 
on the evolution of quantities. EMMA being a multi-physics 
code, a whole list of time scales are computed for all the 
cells of a given level and the smallest set its global time 
step: 


• we follow Teyssier (2002), and limit the rate of change 
of the cosmological expansion factor (5a(Atcosmo)/tt < e, 

• a particle cannot move on a scale larger than the size 
of a cell : Atpic = eAa^^/'Umax, 

• the local dynamical time must be resolved : Atdyn = 
e/ ^/Gp, 


• the hydrodynamical Courant condition must be satis¬ 
fied : Athyd < IS.XilVh (see Eg. pTf 

• If light sources are present, the radiation Courant con¬ 
dition must be satished : Atrad < Ax^/c. It ensures that 
light propagation is performed in a stable manner. 


When full physics are included and effective, the most strin¬ 
gent condition is usually provided by Atrad and by orders 
of magnitude since usually Vh- This dominance can be 
reduced by setting a reduced speed of light. Furthermore as 
non-linearities increase (and even more when strong shocks 
such as induced by supernovae feedback will be included) 
the ratio Athyd/Atrad tends to decrease. 


3 COSMOLOGICAL SETTING 


Cosmological experiments are implemented using the set of 
’super-comoving’ variables suggested by [Martel fc Shapir^ 
(1998). The transformation from physical to supercomoving 
variables are given by: 


f = 

r 

ar^ ’ 

(52) 

V = 

av 

(53) 

p = 

P 

a^p^ ’ 

(54) 

p = 

a p 

(55) 

0 = 

a^cj) 

(56) 


dt 


dt = 


(57) 

N = 

a Ar* 

(58) 

F = 


(59) 


where starred quantities stand for normalization units with 
— L (the box length), = 2/(iLoV^m), 
p* = 3iLo^rn/(87rG) and p* = p^v\. Hq and a{t) stand for 
the usual current Hubble parameter and the time-dependent 
expansion factor. 

With this set of transformation, it can be shown that 
almost all the differential equations to be solved keep their 
standard expression for ay = 5/3 gas. The only notable 
exception is the Poisson equation which becomes: 

A0 = 6a(5, (60) 

where 5 — p/{p) — 1. Still, this equation remains typical of 
an elliptic equation that can be solved by all the methods 
already in place for the Newtonian held equation. Overall, 
the use of such a transformation greatly simplihes the imple¬ 
mentation of cosmological settings in this kind of simulation 
code. 


4 PARALLELIZATION AND 

VECTORIZATION 

EMMA is a parallel code which includes two levels of multi¬ 
tasking. The hrst one is the standard multi-CPU mode, 
where the computational domain is distributed among sev¬ 
eral processes that communicate with each other via the 
MPI protocol. The second level of parallelism resides within 
an MPI-process where the local load is distributed among 
several threads of execution. In the case of EMMA, this local 
parallelisation is performed on GPUs but could in principle 
be extended to other modes of multi-threading such as lo¬ 
cal shared-memory parallelism among multiple CPU cores 
or other hardware accelerators. The second level of paral¬ 
lelism can be understood as a vectorization, where arrays 
of data are processed in parallels through the same set of 
instructions with minimal communications. 

Bearing this two-levels parallelism in mind, EMMA has 
been designed to decouple as much as possible instructions 
that deal with the logistic of data from the ones that ac¬ 
tually perform calculations. Logistics operations are dehned 
as operating directly on the AMR tree : e.g. cell marking 
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and refinement, tree management and inter-process commu¬ 
nications. These operations are handled by CPUs. Comput¬ 
ing functions on the other hand expect arrays of data to 
be ’crunched’, without any mention to tree-organized data 
or inter-process parallelism and return likewise array of re¬ 
sults. The physics solvers belong to this second category 
and are meant to be processed by vector based-hardware, 
such as multi-core processors, GPUs or any other kind of 
co-processor. In between, a set of interface functions must 
be developed to perform gather/scatter operations from/to 
the AMR tree to/from the calculations arrays. These aspects 
are developed in the next subsections. 


4.1 Distributed parallelism on multiple CPUs 


Distributed parallelism is handled through a space-filling 
curve domain decomposition. Such a curve provides a ID 
mapping of a 3D grid by assigning a unique key to each oct 
as a function of its Cartesian position. The number rip of 
parallel processes being defined, the curve is split in rip suc¬ 
cessive parts with equal loads, thus assigning a set of octs 
to each process. The number rip can thus be arbitrary and 
the ID mapping alleviates the need to deal with multiple 
boundaries along multiple directions. EMMA has been imple¬ 
mented with both a Peano-Hilbert space-filling curve and 
a slab-based key ordering for problems with unidirectional 
variations (such as the Shock Tube or the Zeldovich Pan¬ 
cake) . Currently, the domain decomposition is performed at 
the level ic corresponding to the base resolution of the sim¬ 
ulation : all the £c octs are distributed among the processes, 
in such a way that each process possesses at least one such 
oct. All octs created from a level ic cell are assigned to the 
same process. At the current stage, EMMA does not perform 
any kind of load-balancing that could be obtained by sliding 
the limits of the ID domains along the space-hlling curve to 
optimize the distribution of work among the processes. 

It should be emphasized that the AMR structure can 
only be fully exploited if it remains ‘consistent’: no holes, no 
level jumps greater than 1 from one cell to another, point¬ 
ers toward neighboring cells must exist, etc... Furthermore, 
a given process should be aware, at least partially, of the 
AMR tree structure of the neighbor processes. EMMA copes 
with these issues by employing the local essential tree de¬ 
composition ( Warren Salmon^ 1993 Dubinski|1996 ) : each 
process, even though it has been assigned only a subset of 
the total volume, is aware of the whole hierarchy of nested 
octs but only at the levels relevant to its tasks (see Fig. 
1^. Neighbor cells directly in contact with its domain bring 
their whole tree from ^ = 1 to ^max, those being second or¬ 
der neighbors are one level coarser and so on. In practice, 
this local hierarchy is obtained at the initial building of the 
oct-tree: from the root i = 1 oct, cells are refined down to 
the coarse level if they belong to the sub-volume assigned 
to the current processor or if they are direct neighbors of 
this sub-volume. It produces naturally a local tree for each 
processor, individually fully consistent and yet aware of the 
structure of the direct neighbors. 

Communications are handled using the MPI-protocol 
and if a given process Pq requires data present on other pro¬ 
cesses, it must be performed explicitly by specifying which 
octs should come from which other MPI process. This com- 



Figure 4. An example of sub-domain seen by a processor in 
the essential tree paradigm. The processor has been assigned the 
lower left corner (shaded) but is aware of the whole computational 
domain. Distant regions are coarsened relative to close ones. 


munication protocol in EMMA has been written in the follow¬ 
ing fashion: 

• First, all processes Pi build their own lists of neighbor 
MPI processes {Pj} with i ^ j. 

• For each member of {Pj}, a list of requests (i.e. of neigh¬ 
bor octs) is established by Pi by storing their space-filling 
curve keys. This list of keys is sent from each Pi to all 
its {Pj} : Pi acts as a client sending requests to neighbor 
servers. 

• Likewise each Pi receives a list of requests from the 
same sources : Pi acts as a server to neighbor clients. 

• Each client key is processed by Pi through an hash table 
to relate the absolute key to a local pointer to an oct. The 
data is gathered and sent back to the clients {Pj}. 

• Meanwhile, the data from the servers {Pj} is received 
and scattered back in the local tree by Pi. 

This set of instructions is performed level by level and called 
by the Poisson, the hydrodynamics and the radiation solvers 
to update border cells that belong to other processes and 
that have been remotely modified: the flux of information 
can be considered as outside-in. However there are situation 
where the information flux is inside-out: a process performed 
locally will affect directly a value outside its domain. The 
first example of inside-out communication is the CIC assign¬ 
ment, where a local particles will contribute to an adjacent 
domain. The second example are the conservative updates 
due to hydro or radiative fluxes between cells at different 
levels and belonging to different processes. Because this up¬ 
date is asynchronous between levels, conservative values can 
be updated in a neighbor coarser cell and must be communi¬ 
cated to its home process. In this case the protocol is similar 
to the one described above except that there are no request 
stage and data is sent directly from the server to the client. 
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Flat Vector 


Figure 5. Schematics of the conversion between oct-tree data 
management by the CPU for the AMR structure and array based 
calculations, for instance on GPU. A back and forth flow of data 
is performed through gather/scatter operations. Grey levels label 
different reflnement levels both in the tree and the array. 


It should be noted that without load-balancing, the list 
of neighbors {Pj} is static for each Pi. Hence the first stage 
of the communication protocol has to be performed only 
once. However the list of neighbor octs is dynamic, because 
of mesh refinement. Therefore, for a given pair client/server 
Pi / Pj , the list of requests is changing on the fly and should 
typically be recomputed any time octs have been created or 
destroyed. 


4.2 Local vectorization 

Local parallelism relies on a vector-based strategy, where 
arrays of data are processed through the same set of in¬ 
structions and possibly on architecture with vectorization 
capability. The driver for this choice of design is the re¬ 
cent emergence of multi-core processors, graphics process¬ 
ing units or CPU-based co-processors, that all relies on this 
programming paradigm to be fully efficient. In the case of 
EMMA, this kind of parallelisation is focused on the physics 
solvers (i.e. the relaxation of the Poisson equation, the con¬ 
servative update of hydrodynamics and radiative transfer 
and the chemistry solver) which are fed with arrays of ini¬ 
tial states and evolved into arrays of updated values. 

Relying on vectors presents several pros. First it guar¬ 
antees an optimal data layout in general by ensuring that 
it is accessed in a coalesced and aligned manner : computa¬ 
tion directly on tree stored values would induce random and 
unpredictable memory accesses, whereas an array-based or¬ 
ganization ensures proximity of successive or concurrent cal¬ 
culations thus providing optimal performances. For GPUs in 
particular, enforcing this kind of memory access is a require¬ 
ment to obtain a maximal throughput of the devices. Fur¬ 
thermore, arrays are generic and simple structures of data, 
that can be processed in a general manner : each element 
of an array is computed like an other one and the imple¬ 
mentation of this single-element flow of instruction can usu¬ 
ally easily ported from one architecture to an another or 
even from one language to another. The difference usually 
arises on the details of the scanning operation on all the 
elements : they can be parsed in sequence for a scalar cal¬ 
culator or by launching multiple threads of a single element 
computation on GPU or shared memory cores or by taking 


advantage of vector abilities of languages such as Fortran 90 
or Python. Overall, vectorization provides an opportunity to 
choose easily a language or an architecture for the code com¬ 
putational modules, without any consideration on the design 
and layouts of the data structure. For instance, EMMA has 
been coded into both scalar GPU and GUDA GPU versions 
of the Poisson, hydro and radiative transfer solvers, both 
versions working in the same AMR framework. In fact, up¬ 
coming developments may lead to changes in the way AMR 
is handled and it would not impact the way physical engines 
are designed. 

However, it becomes readily apparent that the AMR 
oct-tree being a non-vector based way to store data, the 
latter must be converted back and forth from a tree-based 
organization to an array-based one (see Fig.|^. These gather 
and seatter operation are critical to the code performance as 
they constitute bottlenecks to the overall code performances. 
Nevertheless, if the amount of calculation is large enough, 
the cost of these operations can be hidden by computing 
or by overlapping transfers and calculations. In practice in 
EMMA, when data is gathered from the tree to update a value 
in a given cell, all the necessary values from the neighbors 
are gathered too. For instance, Eq. [^requests 7 values to up¬ 
date the potential of a given cell (6 from the cardinal neigh¬ 
bors and 1 for the density). Their related gather operation 
therefore organizes data in 7 arrays of Ua values, required 
to update the potential in Ua cells. Of course, since two ad¬ 
jacent cells share some neighbors, the data in these arrays 
can be redundant. Similarly, intercell fluxes (used during 
hydrodynamics and radiative transfer) are computed twice 
for two adjacent cells. In principle such overheads could be 
avoided but at the cost of coding simplicity and at the cur¬ 
rent stage the data or flux evaluation has been kept re¬ 
dundant. Gather operations are also in charge of dealing 
with resolution jumps : if a given cell requests data from a 
neighbor at an unavailable resolution, it is interpolated lin¬ 
early from coarse data at the position of the fine virtual cell. 
Global boundary conditions are also dealt by these gathering 
operations As said previously, boundary conditions are peri¬ 
odic by nature : if transmissive boundaries are required, the 
gather operation replaces the data from the periodic neigh¬ 
bor cell by the data of the current cell. If reflective bound¬ 
aries are set up, the same operation is performed with an 
additional flux inversion. 


5 COARSE RADIATIVE TRANSPORT 
APPROXIMATION (CRTA) 

Sec. [^describes the standard methodology to couple the dif¬ 
ferent physics within an AMR code with adaptive time step¬ 
ping. Physical quantities and data structures are updated 
at the pace of the fastest evolving ’’dynamics” among the 
collisionless, the hydro- and the radiative ones. This imple¬ 
mentation has been proven to be both accurate and practi¬ 
cally sustainable (in terms of required computing resources) 
for hydro-dynamical codes in the past. However the inclu¬ 
sion of explicit radiative transport and out-of equilibrium 
chemistry severely impact the code’s efficiency as it must 
track processes on time scales one or two orders of mag¬ 
nitude smaller than the pure hydrodynamical case. Hence 
an experiment covering a given physical duration must be 
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Figure 6. Comparison of the time stepping and multi-physics 
coupling in the standard AMR approach (top, described in Sec[^ 
and in the coarse radiative transport approximation (CRTA) 
description (bottom, described in Sec. [^. Thick black/ thin 
red/dashed blue arrows stand respectively for dynamics (col- 
lisionless+hydro) operations, radiative transport and thermo¬ 
chemistry. The thermo-chemistry is shown as a dashed line to 
indicate that it is already sub-cycled with respect to the radiative 
transport. In both cases, the simulation is updated by a dynam¬ 
ical time step (from left to right), taken to be equal to 4 times 
the coarse radiative time step. Note how the red arrows have the 
same length in both scheme, corresponding to a coarse radiative 
transport time step. It is also assumed that 2 additional levels of 
refinement are enabled (L+I and L+2). In the standard case, such 
an update takes 4 full updates of the AMR tree, because each up¬ 
date covers a radiative time step, resulting in a total of 83 engine 
calls. In the CRTA case, the dynamics are updated first over a 
dynamical time step (7 calls) and radiation + thermo-chemistry 
are treated in a second stage (16 calls) for a total of 23 calls. Note 
how the radiation transport is only performed at the coarse level. 


sampled with a number of operations one or two orders of 
magnitude greater, including not only physical engines but 
also any kind of logistics functions or overhead. As such, 
any deviation to a perfectly optimized set of operations can 
see its magnitude multiplied by a factor 100 and reduces 
significantly the code efficiency. 

In this section we suggest an additional level of approx¬ 
imation for the coupling between radiative processes and 
dynamics (hydro + collisionless). It can significantly reduce 
the resources necessary for a simulation with radiation, at 
the cost of a degraded (mostly spatial) resolution. It is sum¬ 
marized in Fig. and relies on two sets of additional ap¬ 


proximations compared to the standard implementation of 
Sec. [2 


(i) Radiation transport and the associated thermo¬ 
chemistry is explicitly decoupled from the dynamics 
(collisionless+hydro-). Hence to advance the simulation, dy¬ 
namical quantities are updated first on all the AMR levels 
on a timescale only constrained by the dynamics (usually set 
by the hydro- CFL condition). Then, matter is considered 
as ’’frozen” and radiation is propagated within this static 
distribution for the same duration. Since typical speeds en¬ 
countered in dynamical processes are of the order of the local 
sound speed or free-fall velocities, which are much smaller 
than c, such decoupling should remain under control and 
provide results similar to the standard procedure. Of course, 
radiation is subject to stringent CFL condition which implies 
that radiative quantities are updated through an intensive 
subcycling (typically 100-1000 cycles) of the dynamical time 
step with small radiative time scales. 

(ii) Additionally, radiative transport is only performed at 
the coarse level. However, thermochemistry is still computed 
on refined levels, but with a coarse-grained description of 
the radiation field that is simply injected from the coarse 
to the fine levels. Not only it reduces the number of trans¬ 
port operation but it also reduces significantly the number 
of thermochemistry steps : this engine is already subcycled 
even in the standard approach (see Sec. 2.2.4) and can there¬ 
fore operate on a large radiative time step. Furthermore if 
an equilibrium situation is encountered in a given cell (a 
frequent situation in fully ionized regions for instance), this 
thermo-chemistry subcycling can be reduced to a few cycles. 


Fig. [^provides a simplified comparison of the standard and 
the coarse radiative transport approximation scheme (CRTA 
hereafter). We arbitrarily chose a situation where the coarse 
dynamical time step is 4 times larger than the radiative one 
at the coarse level. In the standard description the time step 
is set by the radiative CFL condition at all levels. Hence for a 
coarse+2 refined levels situation as the one described in Fig. 
I^the number of dynamics, radiative transport and thermo¬ 
chemistry engine calls are identical and equal to 28 (4 on the 
coarse level and 24 on the refined ones) for a total of 84 calls. 
In the CRTA case it reduces to 7 dynamics calls (including 
6 calls on refined levels) + 4 radiative transport calls at the 
coarse level +12 thermo-chemistry calls for a total 23 engine 
calls, i.e. a factor of 3 smaller in the number of operations. 
Bear in mind that a realistic case rather involves a ratio of 
100 to 1000 between dynamics and radiative time steps and 
5-10 refinement levels: in such cases, the CRTA approxima¬ 
tion essentially reduces the cost of hydrodynamics to zero 
and by neglecting transport on refined levels, it reduces the 
cost of a radiative time step by a few tens. Additionally, 
AMR logistics, communications setups, analytics, etc., are 
performed only at the dynamical time step and their costs 
are also essentially set to zero in the CRTA approach. In¬ 
cidentally, this technique also increases the relative weight 
of physical engines over numerical overheads and among the 
engines it increases drastically the weight of radiative trans- 
port+chemistry over the others : it turns out that the latter 
engine is one of the most efficient in the use of vectorization 
(see Sec.[^ and therefore in the use of hardware accelerators 
such as GPUs. CRTA is expected to take a greater benefit 
of such devices to accelerate the code. 
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Of course this increased efficiency comes at a cost. The 
most evident one is the decreased spatial resolution for 
radiative transport, even though the thermo-chemistry is 
performed at the highest resolution available. It could be 
thought of as an intermediate approach between an homo¬ 
geneous radiation field (as usually assumed in non-RT cos¬ 
mological simulations) and a full AMR description. In the 
CRTA approach, spatial UV field fluctuations are existent 
but coarsened. However the impact could be limited. First, 
as shown in Aubert & Teyssier (2010), radiation fields (ra¬ 
diative density and flux) do not exhibit significant clump¬ 
ing factors compared to the ones of the matter distribution 
and are relatively smooth even in highly resolved simula¬ 
tions. In fact, the values of radiative densities span orders 
of magnitude between sources and dark voids and are there¬ 
fore not very sensitive to local fluctuations. Furthermore, we 
use here a highly diffusive scheme (based on a LF intercell 
flux evaluation) which accentuates further the smooth as¬ 
pect of radiative fields. One could therefore argue that hav¬ 
ing a fine spatial description of radiative density fields is not 
absolutely necessary. The other level of approximation is the 
imperfect coupling of radiation and matter, the latter being 
considered as still when light is being cast. Somehow, it re¬ 
lates to previous post-processing techniques but performed 
on the fly at every dynamical time step. Post-processing is 
known to provide satisfying results for large scale experi¬ 
ments on ~ 50+ Mpc scales, hence we can be confident that 
this imperfect coupling can be controlled in such cases. On 
the other hand, it is clear that some coupling between mat¬ 
ter and radiation in highly refined cells will be lost and it 
is difficult to evaluate properly this loss. As shown in Sec. 
the CRTA returns very satisfying results for the tests shown 
here but in general using CRTA would require to perform 
an additional level of convergence study to ensure that this 
imperfect coupling is under control. 


6 CODE VALIDATION 

EMMA has been submitted to a series of test to validate its im¬ 
plementation. For various combination of simulated physics, 
documented experiments are described here, as well as the 
code results. 


6.1 ID Hydrodynamics : Shock Tube 


The shock tube is a ID test where a Riemann problem is 
evolved by means of a simulation. It focuses on implementa¬ 
tion of hydrodynamics and the ability of the MUSCL scheme 
and HLLC Riemann solvers to capture shock features. The 
initial conditions consist in a jump at Xq = 0.3125 be¬ 
tween two different states (pi = l,ui = 0,pi = 1 and 
P 2 = 0.125,U 2 = 0.,p2 = 0.1, taken from Toro (1997)). 


The solution to this Riemann problem is known and can 
therefore be compared to the results delivered by EMMA. 

The calculation has been performed using a ^ = 6 coarse 
resolution with 4 additional levels of resolution, triggered by 
density gradients satisfying Ap/p + 0.015. Even though the 
problem is ID, the calculation has been performed in 3D 
with the jump occurring along the x direction. Transmissive 
boundary conditions were retained along the x direction and 
periodic ones along the two others. 


0.2 0.4 0.6 0.8 



X 

Figure 7. Shock tube experiment. From top to bottom: refine¬ 
ment level, density, velocity and pressure as a function of position. 
Points stand for the simulation results and solid lines for analytic 
profiles. 


Fig. Ill shows the density p, the velocity along the x- 
direction u, the pressure p and the refinement level at 
t = 0.2. Also shown is the solution of the Riemann prob¬ 
lem, as a solid line. Clearly, the match is satisfying with 
shocks being resolved on a few cells, thanks to both the 
shock-capturing scheme and the improved resolution allowed 
by the on-the fly refinement. It can also be noted that the 
contact wave is also present near x=0.5, even though some 
smearing can still be present at this resolution. Overall, this 
standard test demonstrates the ability of EMMA to solve clas¬ 
sic hydrodynamical problems at high resolution. 


6.2 ID Gravity+ Hydrodynamics: Zeldovich 
Pancake 


The Zeldovich Pancake test (Zel’dovich 1970)) tracks the 
evolution of a single planar mode in an = 1 expanding 
Universe, where the linear stages of the evolution can be 
analytically predicted. The initial matter density is given 
by: 


p(x) = 1 + I cos(27rx) 

^ Zi 


(61) 


whereas the initial velocity is given by: 

= \ (iV^^.^)3/2 sin(27ra:). (62) 


Here the mode oscillates along the x direction, zi and Zc 
stand for the initial and collapse redshift. 

As in Sec. \6.1\ this test case has been simulated with 
EMMA in 3D as a planar experiment. Both the baryons and 
dark matter were included with = 0.1. The base res¬ 
olution is = 6 (i.e. 64^ cells) and the dark matter field 
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Figure 8. Zeldovich Pancake experiments with baryons (Q = 
= 0.01). From top to bottom: the dark matter velocity 
the gas pressure Pgas, velocity r’cc,gas, and density Pgas,- 
All the quantities are shown as a function of the position, at z=0. 
The sine wave is initiated at z=100 and collapsed at z=10. One 
can note the increasing resolution toward the center of the caustic, 
from i = 6 to i = S. 


is sampled with 64^ particles. The initial temperature is 
chosen to be arbitrarily small at T = 10K and velocities 
orthogonal to the x-direction are taken to be zero. The ex¬ 
periment has been conducted down to z = 0 with Zi = 100 
and ;2;c = 10. Two additional refinement levels were triggered 
on gas density gradients Ap/p > 0.1. This setting provides 
a situation where the cosmological setting and the coupling 
between dark matter and baryons are tested. Linear stages 
were compared to the analytic solution and were found to 
match at better than the % level (not shown here) until the 
redshift of collapse. 

Fig.j^shows the z=0 baryon density, velocity and pres¬ 
sure as well as the dark matter phase diagram. Clearly, be¬ 
ing way later than the collapse redshift (zc = 10), it can be 
noted that several ’plane-crossing’ occurred with a signifi¬ 
cant number of foldings for the dark matter (DM hereafter) 
phase space diagram. Baryons fell in the DM potential, cre¬ 
ating shocks and an inner increase of temperature (via the 
pressure) within the collapsed region. In particular one can 
note how the infall velocity of the gas is strongly reduced as 
it enters the collapsed gas. Refinement levels were triggered 
as expected, providing a better resolution of the density peak 
and a smoother description of the phase space curve of DM 
in the innermost regions. Finally, direct comparisons with 
e.g. Teyssier (2002), shows that the results of EMMA are con¬ 
sistent with other codes, even at this latest stages of the 
pancake collapse. 
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Figure 9. Self Similar 3D Collapse of a Top-Hat density pertur¬ 
bation. Clockwise from top left: the radial profile of refinement 
level dark matter density , baryons radial velocity V and 

density at a=0.56. As in |Bertschinger| ( |1985| >, these quanti¬ 
ties are expressed in units of turnaround quantities. Dots are for 
simulation results and lines stand for the analytic prediction of 
|Bertschinger| ( p~985| >. Radii are in units of the box length. 


6.3 3D Gravity + Hydrodynamics: Bertschinger’s 
Self Similar Infall 


This experiment aims at reproducing the calculation made 
by Bertschinger (1985) where a top-hat overdensity within 
an expanding Einstein-De Sitter Universe (Drn = 1) col¬ 
lapses toward a scale-invariant density distribution in a 
self-similar fashion. Provided that radii are rescaled to the 
turnaround radius A, this self-similarity can be fully pre¬ 
dicted by analytic means. In the original paper, several con¬ 
figuration are explored and here we focus on the evolution of 
an overdensity that includes baryons in a dark-matter dom¬ 
inated potential. Compared to the test described in |6.2| the 
situation here is three-dimensional with a spherical symme¬ 
try. 


In practice, we generated a regular lattice of DM 128^ 
particles starting at z = 1000 in a 1 Mpc box, with cosmo¬ 
logical parameters Drn = 1, = 0.01, Hq = 70 km/s/Mpc. 

Two types of DM particles co-exist : particles within a ra¬ 
dius Ri = 0.05 Mpc to the center were assigned a greater 
mass (equals to 7.89 x lO^M©) than particles at larger dis¬ 
tance (with a 6.37 x 10^M© mass), in order to produce a 
central overdensity Si = 0.2. Also an arbitrarily cold and 
motionless gas has been sampled on a coarse grid of 128^ 
(i.e. ^ = 7), with the same central overdensity as dark mat¬ 
ter : cells within a 0.05 Mpc radius were assigned a 797 
Mq mass and ones at greater radii were given a 604 M© 
mass. Mesh refinement triggers when the mass within a cell 
is greater than 8 times the mass of a low-mass DM parti¬ 
cle : this criteria is similar to the one used in cosmological 
simulations in order to provide a quasi-Lagrangian strategy. 

Fig-i shows the radial profiles of the DM and baryon 
densities as well as the baryon radial velocity, for the simu- 
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lation and compared to the fits of the analytic solution pro¬ 


vided by Bertschinger (1985). Also shown is the spherical 


average of the refinement level. As in Bertschinger (1985), 
the DM density Ddm, the baryonic density Dh and the bary- 
onic radial velocity V are expressed in units of ‘turnaround’ 
values : 


Ddm — 

Pdm 


(63) 


Pta 


Db = 

Ph 


(64) 


P6,ta 


U = 

Vb 

VtA 


(65) 

where pta 

= ^ and /56,ta 


Bertschinger 




( 66 ) 


where ti^ Si, and Ri are respectively the initial time, the 
initial overdensity and the initial overdensity radius, from 
which the expression of the associated velocity can be ob¬ 
tained : 


Vta=^. (67) 

All the results are shown for a = 0.56. Clearly we manage to 
reproduce the analytic solution in particular the predicted 
inner logarithmic slope of —9/4 for the DM density profile 
or the shock positions. A small amount of diffusion can be 
seen in the innermost regions in the baryon density and the 
velocity jump from the Hubble flow to the shocked region is 
not as sharp as the predicted one, but the overall features are 
well reproduced by our calculations and of the same quality 
described for ART or RAMSES. 


6.4 3D radiative hydrodynamics: Growth of an 
HII region 

This test consists of a corner source, powered by a 100 000 
K black-body that send photons in a surrounding homoge¬ 
neous hydrogen-only medium. It belongs to the suite pro¬ 


posed by Iliev et al. (2006 2009) and comes in two different 


versions. The first version deals with a static and uncoupled 


gas (’Test 2’, Iliev et al. (2006)): we obtained a very good 
agreement with EMMA (not shown here), which does not come 
as a surprise since this test was also successfully passed by 
ATON or RAMSES-RT, that share a great number of details with 
EMMA. Here we focus on the second version, a coupled test 
known as ’Test 5’. The UV photons ionize and heat up the 
gas, leading to the creation of ionization fronts that propa¬ 
gate inside-out and putting the gas into motion. This test 
couples radiative transfer and hydrodynamics and has been 


performed by a whole serie of codes in Iliev et al. (2009). 


A 100 000 K Black-body is located at the corner 
(x=y=z=0) of an 15 kpc box, emitting 5 x 10^® UV photons 
per second. We sample the frequencies with the 3 groups 
of photons given in Sec |2.2.3[ The surrounding gas has an 
homogeneous number density of 0.001 hydrogen atoms per 
cm^. The calculation is run on a 64^ coarse grid {i = 6) 
and allowing for an additional level of refinement (128^ 


= 7) to comply with the resolution requirements of Iliev| 
et al. (2009). The refinement is simply triggered for cells with 


ionized fraction 0.01 < x < 0.8 : with the two-cells layer of 


Log10(density[cm“3]) 



x/Lbox 



Figure 10. Expansion of an HII region. Top: the loglO of density 
map (in cm“^), Bottom: the AMR grid used for the computation. 
The coordinates are expressed in units of the box length which has 
a physical extent of 15 kpc. The source is located at the bottom 
left corner and has been ignited 200 Myrs ago. 


neighbours being also refined, it provides a simple manner 
to track the ionization front. Boundary conditions are reflec¬ 
tive for boundaries adjacent to the source and transmissive 
otherwise. 

Fig. |10| shows the baryon density in the z = 0 plane, as 
well as the AMR structure that tracks the ionisation front 
for t=200 Myr. The source being in the bottom left corner, 
the distant cells remain at the base resolution as the front 
has not yet progressed to these regions. The cells close to 
the source are also at the base resolution : this region has 
returned to low resolution as it does not contain an ioniza¬ 
tion fraction that satisfies the refinement criterion (chosen 
to track front-like features). 

Finally between these two ensembles of coarse cells, one 
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Figure 11. Expansion of an HII region. Clockwise from top left: 
the gas density,the temperature, the Mach number and the ion¬ 
ized/neutral fraction. The coarse resolution is ^ = 6 and refine¬ 
ment is triggered on the ionization front to ^ = 7 in accordance 
with the Iliev et al. test 5. Red dashed lines stand for the ra¬ 
dial average taken 200 Myrs after the source has been ignited. 
Blue lines stand for the same calculation, but performed with the 
CRTA approximation. 



Figure 12. Expansion of an HII region. Front propagation pre¬ 
dicted by the standard AMR RT implementation (red line) and 
the CRTA approximation (dots). Top : the front velocity. Bottom: 
the front position. 


can find the refined region at 128^ resolution, tracking the 
front. On the other panel, the number density field of hydro¬ 
gen is also shown, presenting a typical double peak structure 
encompassing a void created by the energy injection by the 
corner source. 

Fig. m shows the radial profile of the density at the 
same instant (i.e. 200 Myrs after the source ignition) as 
well as the temperature, ionisation/neutral fraction and the 
Mach number profiles. EMMA recovers the typical features al¬ 
ready obtained by most of the codes of Iliev et al. 2009. The 
double peak is due to the presence of high energy photons 
in the spectrum of the 10^ black body: these photons with 
large mean free path can deposit energy behind the ionisa¬ 
tion front, while lower energy photons deposit their energy 
at the base of the front. The input of energy at larger radii 
is also at the origin of the moderate temperature increase 
seen before its drop in the neutral region. The effect of hard 
photons can also be seen in the extent of the ionized fraction 
drop at the front, that would be much sharper in the case of 
a monochromatic incoming flux. Overall, the fact that EMMA 
reproduces these specific features seen by all other codes in¬ 
dicates that the coupling between radiation and matter and 
the handling of multi-frequency transfer is consistent with 
others implementations. 

Regarding the CRTA approximation, Fig. pT] andalso 
provides the different fields profiles as well as a comparison 
of the temporal evolution of the front position and velocity 
(the front being defined as having a 50% ionised fraction) 
in both cases. Clearly the CRTA approximation provides 
the same results as the standard calculation : in Fig. pTjthe 
radial profiles taken at t = 200 Myrs are almost indistin¬ 
guishable and in Fig. the front position and velocities 
of the CRTA calculation follow the ones obtained from the 
standard procedure. The evolution is smooth enough both 
spatially (with features sampled on ~ 15 fine cells or 7 coarse 
cells) and temporally (with terminal velocities as small as 
0.1% box lengths per Myr) to ensure a good convergence of 
the CRTA toward the standard case. 


6.5 3D radiative hydrodynamics: 

Photo-Evaporation of a dense clump 


This situation has also been suggested by Iliev et al. (2009) 


and consists of a dense cold clump irradiated by a planar 
UV front. As the ionization front encounters the cloud, the 
high density will slow down its progression, acting as a trap 
on the incoming photons. As a side effect, a shadow will also 
be cast in the trail region of the clump. Finally, the energy 
deposited in the clump will put the gas in motion, leading to 
a photo-evaporation process by the incoming photon flux. 

The setup is given by Iliev et al] ( |2009| : a spherical 
clump of radius 0.8 kpc is centered on (5.4, 3.3, 3.3) kpc 
inside a 6.6 kpc box. Outside the clump, the gas has a 8000 
K temperature with a density of 200 atoms/m^. The clump 
itself has density of 40 000 atoms/m^ and a temperature 
of 100 K. A UV flux with a 100 OOOK black body spec¬ 
trum is incoming from the x=0 boundary at a rate of 10^° 
photons/m^/s. In practice the simulation is performed on 
a 64^ grid (i = 6) with one additional refinement level to 
comply with the [Iliev et al. (2009) recommended resolution. 
Mesh refinement is triggered for cells with a density greater 
than the background density. The x=0 boundary is a source 
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Figure 13. Photo-Evaporation of a dense clump. Top: loglO of 
the neutral fraction along the plane of symmetry of the clump at 
t=10 Myrs, as predicted by the standard AMR RT implementa¬ 
tion (top) and the CRTA approximation (middle), both assuming 
an ^ = 6 base level + 1 level of refinement. The bottom row is 
the prediction of the CRTA approximation on a static £ = 7 grid. 
Bottom: the same quantities at t=35 Myrs. Blue stands as ionized 
and red as neutral. 
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Figure 14. Same as in Fig.^jbut for the gas density. Underdense 
regions are blue and dense regions are red. 


of flux of the required rate, whereas the x=6.6 kpc boundary 
is transmissive. Boundaries in the two other directions are 
periodic. 

Fig, and show maps of the neutral fraction and 
gas density at t=10 and 35 Myrs. In each figure the top and 
middle row were obtained from 64^ simulations with an ad¬ 
ditional level of refinement, the top being obtained from the 
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Figure 15. Photo-evaporation of a dense clump. Position (top) 
and velocity (bottom) of the ionization front along the x axis as 
a function of time. The solid line stands for the standard AMR 
RT implementation results and blue crosses stand for the CRTA 
approximation results. Both were obtained using an ^ = 6 base 
level + 1 level of refinement. Red dots stand for the CRTA ap¬ 
proximation results with a. i = 7 static grid. 


standard RT implementation on AMR and the middle panel 
being obtained from CRTA approximation simulations. 


Globally a simple comparison to Iliev et al. (2009) re¬ 


sults demonstrate their consistency. In particular, the evap¬ 
oration is made obvious with the expansion of the cloud 
limits due to the energy injected by the UV front. However 
clear differences can also be noted : first the shadow behind 
the clump, albeit existent, is much weaker than in other 
radiative transfer codes. This does not come as a surprise 
since EMMA implements the GLF flux to compute intercell 
exchanges and is known to be very diffusive. It therefore 
prevents the creation of clear cut shadows as a diffuse com¬ 
ponent of the flux eats the neutral gas in direction orthog¬ 
onal to the incoming direction of UV photons. The same 
effect was already noted for ATON ( Aubert fc Teyssier|2QQ8 ). 
Second, at late time, the contours of the extended cloud are 
not as spherical as expected and present significant fluctua¬ 
tions around a mean radius. These fluctuations are artifacts 
of the cloud initial sampling on the coarse i = 6 grid. The 
same artifact can be seen e.g. in the FLASH-HG results in 


Iliev et ah (2009) and also linked to the initial conditions. 


Gomparing the standard RT and the GRTA approximation, 
it can be seen that the latter provides a faster ionization 
of the clump. Looking at the Fig. the shadowed neutral 
clumps are systematically smaller in the GRTA regime. It 
does not come as a surprise since the RT is performed at 
the base level only, which increases artificially the extent of 
the UV flux penetration into the clump and also increases 
the scheme diffusion. Fig. |15| provides a more quantitative 
insight on this aspect, describing the front progression in¬ 


side the cold clump and its velocity. The front position is 
defined as corresponding with the position of the cell hav¬ 
ing X = 0.5. At early stages (t < 7.5 Myr), the GRTA 
and standard description produce an identical front prop¬ 
agation. However it can be noted that the GRTA presents 
a step-like progression due to the coarser resolution of the 
radiative transfer which translates into a coarser resolution 
of the ionized front. Later on, the front is pushed back by 
the expanding cloud in both descriptions (as can be seen 
from the recessing velocities), but it happens earlier in the 
GRTA case. Finally the front cannot be tracked for t > 38 
Myrs, as no cells with a neutral fraction greater than 0.5 
can be found anymore. Let us mention that comparisons of 
the standard calculations with the results presented in |Iliev| 
et ah (2009) confirms the capacity of EMMA to track correctly 
the front propagation within the clump. In particular, EMMA 
recovers as the other codes the phase where the front is 
pushed back by the expanding cloud, when t ~ 35 Myrs. 
Globally a faster photo-evaporation of the cloud can be de¬ 
tected in the GRTA approximation compare to the standard 
AMR description, essentially due to the coarse description 
of the radiative fields. 

Finally, we present in the lower panels of Fig. and 
[^the results of a GRTA calculation on a static i = 7 grid. 
It allows us to probe the separate the effects of incomplete 
coupling between dynamics and radiative transfer and of 
the coarse description of radiation. In this experiment, the 
GRTA is equivalent to a radiative post-processing of the dy¬ 
namics but performed on the fly, at the temporal scale of 
dynamical times and without any impact of a coarsened ra¬ 
diative transfer. Gompared to the standard treatment, radia¬ 
tive transfer is subcycled with respect to dynamics, leading 
to a large number of radiative transfer+thermochemistry 
calls per single gravity and hydro calculation. Glearly the 
GRTA greatly reproduces the standard calculation in this 
case, both in the neutral fraction and density maps and in 
the propagation of the fronts. It confirms that the coarsened 
resolution is indeed the reason for an accelerated photo¬ 
evaporation of the clump and that the radiation subcy¬ 
cling does not induce significant deviation from the standard 
treatment. 


6.6 Cosmological runs 

6.6.1 Preliminary Reionization Simulations 

Finally, we present the results of full simulations of cosmo¬ 
logical reionization. The focus is put on hydrodynamical 
simulations with radiative transport and on reionization- 
related quantities but additional tests on the dark matter 
haloes mass function or the energy conservation are pre¬ 
sented in appendix We produced a set of 4 simulations 
with 4 different specific emissivities for the sources. Each 
simulation consists in a 4 Mpc/h box sampled with 128^ 
base resolution cells and 128^ dark matter particles. These 
simulations will be referred as VO.3, VI, V3 and V30. The 
V1 simulation is a fiducial run with sources emissivities that 
produce a reasonable ionization history. The three additional 
cases uses stars with boosted or depleted specific emissivi¬ 
ties by the corresponding factor, VO.3 and V30 standing 
respectively for the dimmest and brightest source model. 


Initial conditions were produced using Mpgrafic (Prunet 
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et al. 2008) with a Planck cosmology (Planck Collabora- 
tion et al.||2013| ) (Qm = 0.315, Qa = 0.685, Qj, = 0.049, = 


0.96, i^o = 67 km/s/Mpc) starting at z = 80. Each DM 
particle weights 4 • lO^M©. AMR is triggered using a quasi- 
Lagrangian strategy and a cell is refined if it contains more 
than 8 DM particles. Radiative transfer is run with 3 groups 
of frequencies([13.6,24.6] eV, [24.6,54.4] eV and [54.4,1000] 
eV, dictated by the ionization thresholds of hydrogen and 
helium). 

In addition to hydrodynamics and radiative transfer, 
we had to implement a simple star formation recipe in or¬ 
der to populate the simulated volume with the ionizing 
sources that drive the reionization process. This star for¬ 
mation model is briefly described here and will be the sub¬ 
ject of a dedicated paper in the near future: its is widely 

(2002); Rasera & 


inspired by Katz et al. (1996); Kay et al. 

[Teyssier (2006); Dubois Teyssier| (2008). A cell is said to 
be prone to star formation if either its gas comoving density 
(n*) or its gas density contrast ((5*) are greater than user-set 
thresholds. Once a cell is flagged to form stars, the number 
of stellar particles to be created is drawn from a Poisson law 
with the A parameter given by: 


A = e 


rUcell At 
TTl£ 


( 68 ) 


A corresponds to the average number of stars created during 
a time step At within a cell that contains a mass of gas given 
by TTiceii • The star formation process is controlled by a typical 
star formation time scale t* and an efficiency parameter e. 
The mass of a stellar particle is given by depends on the 
level of the cell and is equal to 


rrii = pbS^Ax'^. 


(69) 


The following results were obtained with (5* = 150, and 
t*/e = 2Gyrs. This values would be considered as ’stan¬ 
dard’ even though we won’t discuss them here and we will 
explore thoroughly the results dependence on these values 
in a forthcoming paper.As shown hereafter they neverthe¬ 
less lead to a star formation and a reionization process in 
reasonable agreement with constraints. Each stellar particle 
emits photons for 20 Myrs, with a constant emissivity and 


assuming a 50 000 K black body spectrum (see Baek et al. 


(2009)). In practice the source emissivity has been tuned by 
trial and error to produce a reasonable reionization history, 
complete at z ~ 6 and for the fiducial XI model, it results 
in an emissivity of 1.5 x 10^^ ionizing photons/sec/stellar kg. 
Taking the calculation of Baek et al. I pOO^ as a reference, 
which assumes a Salpeter IME and 1 —lOOM© mass range, it 
corresponds to a 15% escape fraction. Again, X0.3, X3 and 
X30 use emissivities multiplied by the corresponding factor, 
the X30 model being clearly over-powered and merely used 
to probe the qualitative behaviour of the code in the regime 
of strong radiation. 

At the current stage we restrict ourselves to this sim¬ 
ple model that obviously lacks important ingredients. Eor 
instance SN/AGN feedback has not been implemented yet, 
chemistry is limited to the simple hydrogen and no modifi¬ 
cation of equation of state is assumed at very high densities. 
As a consequence the star formation rate is essentially not 
regulated in this cosmological toy model. Hence the follow¬ 
ing results could not be considered as definitive regarding 


what the code could do but should rather be seen as tests 
on experimental configurations close to production runs. 

Eig. and present the distribution of matter, AMR 
levels, temperature and hydrogen ionized fraction in a 320 
kpc/h thick slab of the XI run at z = 6.8. Glearly, the 
matter on these scales is already highly structured at z = 
6.8, with regions having a density contrast greater than 1000. 
These regions are effectively tracked by the AMR grid and 
the overall distribution of high resolution grids follow the 
main features of the filamentary structure in this simulated 
volume. Sources are created in this overdensities and their 
radiation leads to large HII regions. One can note how the 
fronts are locally prevented to progress into the IGM by 
filaments and dense clumps, leading to complex features in 
their geometry. It can also be seen that the ionization fronts 
present a certain extent induced by the larger mean free 
path of high energy photons. It also leads to a preheating of 
the gas, behind the ionization fronts, to temperatures close 
to a few thousands K. Within the ionized regions, a quasi 
homogeneous temperature close to 10000 K is set by the UV 
radiation with local fluctuations correlated with the density 
field. Some shock-heated gas (located at ~ [2,1.7] Mpc/h) 
with temperatures greater than 100 000 Kelvins can also be 
seen. 

The fiducial model XI presents a reasonable reioniza¬ 
tion history and SER, in broad agreement with observational 


constraints (see Eig. 18). Gompared to Ean et al. (2006), the 
ionization happens slightly earlier than observed and cor¬ 
respondingly the photoionization rate at z ~ 6 is overesti¬ 


mated when compared to Galverley et al. (2011). The cosmic 
star formation history is also in excess compared to the ob- 


servationnally deduced rates given by Bouwens et al. (2014). 


This fiducial model could have benefited from a slightly im¬ 
proved calibration to reproduce the observed data points, 
however we consider that the current level of agreement is 
good enough at this stage : let ns recall for instance that 
these simulations lack SN feedback and the small simu¬ 
lated volume could also be inadequate to make quantitative 
prediction on cosmic averaged quantities. At this stage we 
merely aim at looking for qualitative and not quantitative 
clues of the impact of radiation within a cosmological set¬ 
tings. 

These clues can be obtained by comparing this fiducial 
simulation with the 3 other models AO.3, A3 and A30. As 
expected these models result in different reionization histo¬ 
ries that are in place at earlier (resp. later) times for larger 
(resp. lower) emissivities (see Eig. p^ . The SER however 
remains essentially unaffected by the change in emissivity, 
except at later times (z > 10) for the models with the bright¬ 
est sources (A30), leading to a depleted star formation rate. 

Eig. presents the baryon fraction and the instan¬ 
taneous SER measured in the dark matter halos found at 
^ = 5.5 in the different models. At this redshift, the reion¬ 
ization is well advanced in most models except in A0.3 
where an 75% ionization level is only achieved. Halos have 


been detected using the HOP halo finder (Eisenstein A Hu 


(1998)) and baryons are counted within R 200 i-e. the radius 
of the spherical region around each halo with an average 
density 200 times greater than the cosmic average matter 
density. ~ 450 halos with a mass greater than 10^h~^M q 
(corresponding to 45 particles) are found. Glearly a signifi¬ 
cant scatter can be found in the distribution of the baryon 
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Figure 16. Structuration of matter in a cosmological radiative 
transfer run of a comoving 4 Mpc/li-128^ box, taken at 2 : = 6.8. 
Top: the baryon overdensity map. Bottom: the AMR levels. The 
shown region has a thickness of 320 comoving kpc/h. 
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Figure 17. Same experiment and region as in Fig. |16| Top: the 
temperature map. Bottom: the hydrogen ionized fraction map. 


fraction but general trends can nevertheless be observed in 
the data : halos with a mass greater than 10 ^ Mq basically 
present a universal fraction whereas lighter objects are more 
dark matter dominated as expected. A comparison of the 
hducial model distribution to the ht provided by [Okamoto] 


et al. (2008) shows a reasonable agreement with a correct 
transition mass at M ~ 3 • 10® M©, even though a signif- 
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redshift z 


icant scatter is obtained. If all the models are considered, 
a clear trend can be noted : the dimmest models (X0.3, 
XI) share the same global behavior (or qualitative func¬ 
tional form) even though the fiducial model presents baryon 
poorer low mass halos. Meanwhile, the brightest models can 
produce baryon fractions 10 times smaller than the fiducial 
case with a different functional relation between the baryon 
fraction and the halo mass. The impact of radiation on this 
quantity seems therefore well established in this series of 
models, where brighter sources have a strong impact on the 
gas within shallow potentials. 


Interestingly, this impact does not directly translate 
into a modified SFR inside the halos (see Fig. 20). Again 
the scatter is quite important and finite mass effects can be 
seen in halos with small formation rates and the interpre¬ 
tation can therefore be difficult. Still, it appears that the 3 
dimmest models (AO.3, the fiducial XI and A3) are not sig¬ 
nificantly different and present the same mass dependence 
of the star formation rate within their halos. Since we found 
that the global baryon quantity is indeed affected, it seems 
to imply that the star forming baryons are unaffected by 
the source emissivity and the presence of radiation. Only 
the most extreme case of source emission shows a significant 
dip in the SFR of low mass halos : in our simplistic model 
of star formation, a certain level of gas depletion must be 
achieved to impact the production of stellar particles. 


As a final note, we present in Fig. [^the halo baryon 
fraction in the 4 models at the same cosmic average ioniza¬ 
tion fraction, x — 0.75. Of course this level of ionization is 
achieved at high redshift (~ 10) for the brightest model and 
corresponds to the last snapshot at z = 5.5 for the dimmest 
one. It can be easily seen that the baryon fraction mass dis¬ 
tribution is essentially identical in the 4 models, taken at 4 
different redshifts but at the same ionization level. It could 
hint that an essential ingredient of the baryon depletion is 
not only the source intensity but also the exposition duration 
to the UV background. In the previous analysis at z = 5.5 
not only the brightest model contains the brightest sources 
but it also provided the longest duration over which halos 
are in an optically thin Universe since such models provide 
an early reionization. Conversely, dimmer models produce a 
late reionization and therefore a shorter exposition duration 
to the UV flux in a transparent Universe. It could impact 
the baryon fraction in low mass halo measured at a given 
time. At a given average ionization fraction, we somehow 
get rid of the scatter in flux exposition and look at halos 
from different simulations at a similar stage of their ’Uni¬ 
verse’ ionization history, with a similar structure for the UV 
field. And indeed, in our model, it significantly reduces the 
differences observed previously. 


Figure 18. Global evolutions of the average volume weighted 
neutral fraction (top), photoionization rate (middle) and star 
formation history (bottom) in 4 Mpc-128^ reionization simula¬ 
tions. The fiducial model is shown in solid-red (labeled as hav¬ 
ing an Al emmissivity) while simulations with different emissiv- 
ities are shown in dashed-blue (with an emissivity equals to 30% 
the fiducial one, AO.3), dotted-black (A3) and dash-dotted-green 
(A30). Observational constraints from [Fan et al.| ( |2006| > (Top 
panel, squares), Calverley et al. ]i |2oTTt (middle panel, points) and 
[Bouwens et al.| ( 2014| > (bottom panel, blue shaded area) are also 
given. 


Let us recall that several important ingredients are 
missing in our models like the inclusion of supernovae feed¬ 
back, which may enhance the SFR suppression in low-mass 
halos or the presence of H2 , which fraction can greatly differ 
from the baryon fraction. Hence the results presented in this 
section indicates in a qualitative manner that EMMA is able 
to handle cosmological reionization simulations. Further in¬ 
vestigations and implementations are necessary to quantita¬ 
tively assess the subjects discussed here. 
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z=5.5 


z=5.5 




Figure 19. Top: Baryon fraction in DM halos as a function of 
their mass at z=5.5 computed in the 4 models of emmisivities. 
Small dots stand for the values for each individual halo whereas 
large symbols stand for the average baryon fraction within a bin of 
halo mass. Bottom: the same quantity but for the fiducial model 
only (dots) compared to the |Okamoto et M^ ( |2Q08] > fit. 


7 PERFORMANCES 
7.1 Preamble 

As a closing chapter to this description of EMMA, we now 
discuss the performances of the code. As shown hereafter, 
the comparison of performances on different architectures 
is a complex matter as it depends on how architecture- 
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Figure 20. The instantaneous star formation rate per DM halo 
mass for the 4 emissivity models. Small dots stand for the val¬ 
ues for each individual halo whereas large symbols stand for the 
average baryon fraction within a bin of halo mass. 


xion=0.75 



Figure 21. The baryon fraction as a function of the halo mass for 
the 4 different models measured at the same ionization fraction 
X = 0.75. The corresponding redshifts are given in the labels. 
The red line stand for the [Okamoto et al.| ( |20Q8| > fit taken at the 
redshift of the fiducial model (^1). 


© 0000 RAS, MNRAS 000 , 000-000 












22 D. Aubert et al. 


dependent optimizations are implemented. Such a compar¬ 
ison also depends on the context it has been performed : 
as such it will evolve in time (as hardware improves for in¬ 
stance) or in ’space’ (from one computer to another one at 
a given time). We nevertheless think that the following sec¬ 
tions will shed some light on how the code behave and how 
these behavior can significantly vary depending on the com¬ 
pilers or the architecture. More generally it is also an op¬ 
portunity to demonstrate that codes performances should 
be carefully considered, and not only for EMMA. 

In the following sections, the calculations involved a 4 
Mpc/h - 128^ cosmological simulation with full physics and 
the same parameters and simple star formation recipe as 
the ones described in section |6.6.1| They used the CRTA 
approximation with a c/10 speed-of-light, which should not 
affect the discussion on raw performance and scaling issues. 
We compared 3 types of EMMA binaries on the Curie-CCRT 
supercomputer hybrid nodes, compiled using single precision 
arithmetic: 

• a GPU binary produced by the NVCC compiler from 
the CUBA 5.5 SDK with 02 optimization level, called gpu- 
02. This version runs the vectorized physical engines on 
M2090 Nvidia GPU devices while still relying on a single 
core to perform the other tasks, such as AMR logistics, vec- 
torization, particles operations, etc... In the case of a multi- 
GPU run, each MPI process is attached to a single GPU 
core associated to a single distinct GPU. 

• a GPU binary produced by GGG 4.4.7, called gcc-02 
hereafter. This version fully runs on 2.7 GHz Sandy bridge 
Westemere processors and uses the standard 02 optimiza¬ 
tion level. 

• a GPU binary produced by IGG 14.0.3. with the same 
02 optimization level, called icc-02 hereafter. Such EMMA 
binaries are usually faster than the ones provided by GGG by 
a factor close to 4 : this difference is essentially the result of 
optimizations on floating point operations that are enabled 
by default. Such optimizations can be disabled by setting an 
additional fp-model=strict flag and produce EMMA binaries 
with reduced performances at the level of the ones produced 
by gee (not shown here). 

IGG is available in most supercomputing and institutional 
facilities. GGG on the other hand is widely distributed and 
could be the only option on small configurations (e.g. on 
laptops, desktop machines or local shared memory calcu¬ 
lators). Since they produce binaries with different perfor¬ 
mances, the resulting GPU acceleration will also depend on 
the GPU-version taken as a reference. 

Gomparisons of GPUs and GPUs are done by consid¬ 
ering one graphics device against one GPU core. Obviously, 
it biases performances in favor of GPUs which are essen¬ 
tially parallel devices. Nevertheless, we argue that it is the 
simplest way to do the comparison, since a given GPU can 
be associated with a variety of different GPU nodes with 
different core numbers. However some care must be taken 
when considering acceleration rates. If an acceleration fac¬ 
tor of x80 is found, it should be seen as considerable since 
80 GPU cores per GPU is already a significant configura¬ 
tion and codes usually don’t follow strong scaling laws at 
such levels of acceleration (i.e. an x80 acceleration cannot 
be obtained using 80 cores). On the other hand if an x4 
acceleration factor is found, it should be considered as low 


since 4 cores are easily obtained and x4 strong scaling factors 
can usually be achieved. In the case of Gurie hybrid nodes, 
1 GPU is associated with 4 cores but other configurations 
exist (e.g. Titan-ORNL associates 16 cores with 1 GPU). 


7.2 Computing time consumption 

Fig presents the time spent by a calculation to reach 
a given expansion factor in a cosmological simulation. 
Whichever code version is considered, two major phases can 
be distinguished : for an expansion factor a = (1 + < 

0.065 the code achieves a stable regime with small and reg- 

! curves). At 
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ular time steps (given by the slope of the Fig. 
this stage, no source has been created yet and no light has 
to be propagated : the radiative engine (which also includes 
thermo-chemistry modules such as cooling processes) is not 
limited by the GFL condition and is called once per dynam¬ 
ical time step. Furthermore, non linearities are small and 
AMR has not been deployed yet, hence the work per coarse 
cell is naturally close to balance. At a ~ 0.065 the first 
source appears and radiative transport must be computed 
while satisfying the stringent GFL condition. The number of 
RT calls per dynamical time step increases to typical levels 
of 150 calls per step. In Fig the time spent increases by 
orders of magnitude with a much greater slope, i.e. a much 
greater time spent per time step. This contribution of RT 
to the computing time is further emphasized by the dashed 
blue line in Fig. which stands for the RT-only contribu¬ 
tion in the gpu-02 calculation (similar curves are obtained 
for the GPUs calculation albeit not shown here): clearly the 
dramatic increase in the computing time is driven by this 
specific module. 

In the same plot, solid lines stand for the computing 
time required for EMMA to reach a given expansion factor us¬ 
ing a M2090 Nvidia GPU Device with gpu-02 {black dashed 
line) and using a single GPU core with gcc-02 (black solid 
line) and icc-02 (black dotted line) binaries. Gomparing 
these different versions of EMMA, it can be seen in Fig. |22| that 
for the gpu-02 version a = 0.07 is achieved in 50 minutes, 
whereas 16 hours are required for the gcc-02 version, pro¬ 
viding an xl6.9 acceleration factor. In the pre-source regime 
(for a < 0.065) this acceleration factor drops to x6.4: in this 
regime the contribution of the radiative transfer engine is 
much smaller and so is the level of potential acceleration. 
In the very first stages of the calculation this acceleration 
rate even drops further (to factor close to x4) as the cooling 
induced by dynamical effects is small and hence the need 
for associated calculations that could have benefited from 
hardware acceleration. If the GPU version is compared to 
the icc-02 run, the maximal acceleration rate of the gpu-02 
code drops to x3.9. Glearly the removal of strict value-safe 
floating point operations (which allows greater optimization 
from the GPU compiler) results in a more competitive GPU 
code performance-wise. Moreover, it should be noted that 
the current comparisons deal with a GPU device against 
a single GPU core as we argued that it provides the sim¬ 
plest mean of comparison. However cores are usually part 
of multi-core nodes, connected to one or two GPU devices. 
Hence an acceleration rate of a few can be seen as not suf¬ 
ficient if it does not exceed the core per GPU ratio. For 
instance we show in Fig.j^the time required for the 4 cores 
of hybrid Gurie node to run the same test (symbols). As can 
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expansion factor 

Figure 22. Comparison of the cumulative time spent to reach 
a given expansion factor for a 4 Mpc/h-128^ cosmological simu¬ 
lation of the reionization. Times are given for a single comput¬ 
ing device (i.e. 1 GPU or 1 CPU core). The thick black dashed 
line stands for the GPU run performed on a M2090 Nvidia GPU 
whereas the thin dashed blue line stands for the contribution of 
radiative transfer to this cost. The black solid (resp. dotted) line 
stands for a single CPU core (2.7 GHz Sandybridge Westemere) 
using the gcc-02 (resp. icc-02) binary. The symbols stand for a 
4-core CPU calculation using icc-02 on a Curie node. 

be seen here, the strong scaling behaviour of EMMA is suffi¬ 
cient to further improve the CPU-consumption by a factor 
of almost 4 and the parallel icc-02 binary slightly outper¬ 
forms the gpu-02 performances. As described in the next 
subsection, these diverse performances result from the im¬ 
pact of CPU optimization and GPU acceleration that are 
not uniformly distributed among the different modules. 

7.3 Detailed computing cost breakdown 

For the same experiments. Fig. |23| presents the cumulative 
time spent in the 3 principal modules (Poisson, Hydro and 
Radiative solver) of EMMA as a function of the number of 
successive calls made to these modules. Solid lines stand for 
the single core CPU calculation obtained with gcc-02^ dot¬ 
ted lines stand for single core CPU calculations produced by 
icc-02 and dashed lines stand for the GPU-driven experi¬ 
ments. . 

Focusing hrst on the gcc-02 results, it appears clearly 
that hydro and RT calculations dominate the overall time 
budget of EMMA. Unsurprisingly, the Poisson solver only con¬ 
tributes marginally to the overall cost : hrst, the amount of 
calculation involved in this stage is small compared to the 
complex hydrodynamical solvers or thermo-chemistry cal¬ 
culations. Second, it relies on an iterative solver, where the 
solution does not evolve quickly from a time step to another 
or only in a few hyper-rehned cells, ensuring a rapid con¬ 
vergence and hence a low computational cost. It can also 


HYD VT 

HYD Cal 

RT VT 

RT Cal 

icc-02 1.2 

26.4 

0.9 

I3.I 

gpu-02 1.76 

0.69 

1.66 

0.85 

Table 1. Typical time 

spent (in 

seconds) 

in the vectoriza- 


tion+transfer operations (VT) and in calculations (Cal) for the 
hydrodynamics (HYD) and radiative transfer modules (RT). 
Times are given for time step #10 of the benchmark simulation 
described in Sec. corresponding to a regime without sources 
and without AMR. 


be noted that the hydrodynamics are the dominant stage at 
early times, being overtaken by RT only as thermo-chemical 
computations start to contribute and obviously at later time 
when the CRTA approximation execute ~ 150 RT calls per 
hydro call. This effect due to CRTA is also evident in the 
number of RT calls which is much greater than the identical 
number of hydro and Poisson calls. 

Looking at the performance of the GPU -driven binaries 
gpu-02^ the time spent in the hydrodynamics and the RT 
is reduced to the levels of the Poisson solver : compared to 
gcc-02 the RT module is accelerated by a factor x32 and the 
hydrodynamics by a factor xl4 . Interestingly we could not 
achieve any acceleration with the Poisson solver on GPU 
architecture. The reason is the poor computation/transfer 
ratio for the Poisson solver: our measurements show that 
gathering the data from the AMR to vector-like structure 
on CPU takes ~ 75% of the time required by the Poisson 
solver: the room for acceleration is therefore extremely small 
whereas in hydro and RT this gathering stage only repre¬ 
sents ~ 5 — 10% of the computation. In general, the acceler¬ 
ation can be efficient on computation-dominated modules, 
and in our implementation the Poisson iterative solver does 
not belong to this family of functions and represents there¬ 
fore an intrinsic limit to EMMA performances on GPUs. 

Finally, dotted lines show the cumulative time per calls 
of a given module but using icc-02. No differences can be 
noted for the hydro and Poisson solver compared to the tim¬ 
ings obtained from gcc-02^ but the time spent into the radia¬ 
tive transfer module is greatly reduced. It is easily explained 
by the important contribution of non trivial mathematical 
operations in cooling rates, cross-sections, ionization rates, 
etc... present in the thermo-chemistry operations handled by 
the RT module. Such operations have a clear beneht from 
the optimizations made by the compiler. This is not the case 
for hydrodynamics : even though a MUSCL scheme involves 
a great number of operations, they essentially rely on simple 
arithmetic operations, which are less prone to optimizations. 
For hydrodynamics, gpu-02 still provides a xl2 acceleration 
compared to icc-02 but RT acceleration rate drops to x5.5 : 
since it is the dominant process, it strongly affects the overall 
GPU acceleration. 

The current limiting factor of GPU performance is the 
cost of vectorization and data transfer to and from the de¬ 
vice. In fact, the near identical hoor performance obtained 
by the three GPU modules is due to the irreducible cost 
of these operations. In Tab.^ we list the time consumption 
for the vectorization-transfert stages as well as for the actual 
computations, measured in a typical early-stage step and for 
icc-02 and gpu-02 binaries. Poisson Solver results are not 
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Figure 23. Cumulative time spent in the Poisson (red), hydro 
(green) and radiative transfer (blue) physical engines as a function 
of number calls. Solid lines stand for single core CPU calculations 
on a 4Mpc/128^ simulation using gcc-02^ dotted lines stand for 
single CPU core calculations using icc-02 and dashed line for 
calculations driven by a single M2090 GPU using gpu-02. 

discussed as they are already dominated by vectorization on 
CPU. We find that for the hydrodynamics and RT the cost 
of these operations is close to 70% on GPUs (see Tab. 
whereas they contribute to less than 10% of the CPU cal¬ 
culation : the acceleration potential of calculations is thus 
almost fully exhausted. It can also be noted from Tab. [^that 
these vectorization/transfer steps are actually more expen¬ 
sive on GPU, because of the additional transfer of data from 
the CPU host to the GPU : the cost of transfer is broadly 
equivalent to the vectorization. It doubles the time spent 
in non-calculations operations which end up dominating the 
cost of the hydrodynamical and radiative transfer modules. 

7.4 Parallel scaling 

Fig. shows the scaling properties for EMMA, where a fix 
load per process is chosen and the number of process is 
increased, thus increasing the volume and total number of 
coarse cells or particles handled by the code. For the gpu- 
02 and gcc-02 scaling measurements, we stick to a 4Mpc - 
128x128x128 coarse cells per process load, similar to the one 
used above. For the icc-02 Intel CPU scaling we varied the 
load per core and used non cubical sub-domains. Configura¬ 
tions from I to 256 GPUs and from I to 2048 cores have been 
used. Times were measured in the initial stages of a cosmo¬ 
logical run, during the first 20 steps. At these early stages, 
non linearities are weak and AMR is not triggered: load im¬ 
balance is minimal and therefore allows a better estimation 
of parallelism-induced deviations. It also corresponds to the 
epoch where the acceleration is the weakest but it should not 
affect our conclusions regarding the scaling abilities of the 
code. Clearly the (weak) scaling trends are satisfying with 
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Figure 24. Empty symbols: scaling curves for different loads 
per process configurations using the icc-02 binary, given as the 
CPU hours per time step as a function of the total number of 
coarse cells. The black solid line represents the perfect CPU scal¬ 
ing (i.e. t ~ with a = 1) while the black dashed line stands 
for the worst scaling law obtained here with a = 1.1. Filled black 
symbols stand for the same measurements but made with the 
gcc-02 binary and a constant load of 128^ per process. Filled 
blue symbols stand for the same measurements on GPUs (using 
gpu-02 binaries) and a constant load of 128^ per process. The 
blue solid line stands for the perfect scaling expected for GPU 
whereas the dashed blue one stands for the one actually mea¬ 
sured with a = 1.05. 

a CPU/GPU computing time that scales as t r\j ivti-i-i] 
where N is the number of cells. In fact on CPU the scaling 
is almost perfect in configurations with a 128x128x128 load 
per process and drifts away for smaller problems per process 
: it is a standard strong scaling issue, where a smaller local 
load increases the weight of parallelism overheads. Further¬ 
more, as the sub-domains become non-cubical the number 
of neighbors and therefore the amount of communication 
varies from one process to another if a Peano-Hilbert seg¬ 
mentation is used, leading to a small unbalance of communi¬ 
cations between processes. Nevertheless, the scaling of EMMA 
running on multiple CPUs and GPUs seems satisfying, aside 
from load-balance issues that will inevitably arise later on 
as structures will emerge. We plan to implement balancing 
procedures in forthcoming developments. 

7.5 Discussion 

Overall, the performance achieved by GPU driven calcula¬ 
tions is promising and acceleration rates greater up to xl6.9 
can be obtained, but such rates should be discussed as per¬ 
formance comparison is a complex matter. First the large ac¬ 
celeration rates are obtained in the regime where the CRTA 
is used and effective (i.e. after the first star has appeared): 
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the CRTA is somewhat designed to favor hardware accelera¬ 
tors as it increases the weight of pure and heavy calculation 
on the overall budget of EMMA. Furthermore it remains an 
approximation with a coarse description of radiative trans¬ 
port, hence it stands as a lower order approximation com¬ 
pared to the standard AMR coupling of radiation to matter. 
Note that even in the CRTA regime this standard coupling is 
naturally enforced in the pre-source stages, and we demon¬ 
strated that only moderate acceleration is achieved in this 
regime (x6.4). Second, we also demonstrated that properly 
optimized CPU binaries can be only a few times slower than 
GPUs. The overall code acceleration rate drops then to x3.9 
even in the CRTA dominated regime, as can be seen by ex¬ 
amining Fig. |22| Finally the performance gap can further be 
reduced by increasing the number of CPU cores used or by 
using full nodes capacities. 

It should be noted that these acceleration are global 
ones, i.e. they rely on global timings of EMMA where a sig¬ 
nificant number of modules remain to be ported on GPUs. 
This is for instance the case for particles-related operations 
which are currently only handled by the CPU. Even if they 
are sub-dominant, optimizing these tasks or porting them 
for GPU architecture could provide a moderate additional 
acceleration. 

However, the most obvious way to increase the GPU 
acceleration is to reduce the cost of vectorization and data 
transfer to the device. The data transfer bottleneck is ex¬ 
pected to evolve naturally as new standards are being devel¬ 
oped to increase the CPU to GPU bandwidth or by using 
architectures such as AMD’s Accelerated Processor Units 
(APUs) where CPU and GPU share the same memory, thus 
nullifying the cost of transfer. It should be noted that thanks 
to the vectorization strategy of EMMA, future ports on new ar¬ 
chitectures should be of limited complexity. Regarding the 
cost of the vectorization (driven by gather/scatter opera¬ 
tions), let us note that this operation is currently purely 
sequential and dealt with by the CPU. Gather/scatter oper¬ 
ations could in principle be parallelized to reduce its imprint 
on the overall costs and to lower the intrinsic floor that lim¬ 
its the GPU performance. Such parallelization is limited by 
concurrent memory accesses but current CPU architectures 
designed with Non Uniform Memory Access (NUMA) should 
in principle alleviate this issue. Another option would be to 
deport the vectorization on the parallel computing device 
( a GPU in our case) : performance gains are expected to 
be limited, since gather/scatter operations rely on non ‘gpu- 
friendly’ tree-walk operations, but even a weak acceleration 
of the vectorization process would provide a welcome boost 
to the calculation acceleration. 

The tests presented here were made on CPU and GPU 
architectures of the same generation (M2090+2.7 GHz West- 
emere on Curie), but newer hardware is and will be available 
and similar tests will be necessary to reassess the results 
shown here. However preliminary tests on more recent de¬ 
vices show no significant improvement in performance (see 
appendix [b| . It does not come as a complete surprise since 
our GPU calculations are currently limited by vectoriza- 
tion+transfer and the more recent hardware does not pro¬ 
vide significant progresses on these aspects. 


^ http://www.nvidia.com/object/nvlink.html 


On a broader perspective, the results presented here 
seems to make a strong case for a full usage of hy¬ 
brid installations, where EMMA would distribute its differ¬ 
ent tasks/modules simultaneously on the different types of 
hardware (multicore CPU, GPU) available on a node. For 
instance gather/scatter operation on a CPU represents typ¬ 
ically 5-10% of the time spent in a physics engine, hence 
there is room for potential acceleration on a multi-core node 
through OpenMP directives, probably of a factor of a few, 
making it competitive with current GPU accelerations. For 
instance, multiple tasks could be done in parallel such as e.g. 
thermo-chemistry on multi-core CPU and radiative trans¬ 
port on GPU and current performances could be increased 
by an overall factor of 2. Of course, these estimates need to 
be confirmed by experiences. 


8 CONCLUSIONS 

EMMA is a cosmological simulation code which handles simul¬ 
taneously gravity, hydrodynamics and radiative transfer on 
an adaptive grid that can be refined on the fly (AMR). Writ¬ 
ten in C, this code is parallel (via the MPI protocol) and can 
deploy its physics modules on graphics processing units us¬ 
ing CUBA. Designed for the study of the reionization epoch, 
EMMA is nevertheless a versatile code for structure formation. 

The code passed a variety of test cases and can confi¬ 
dently produce accurate and relevant simulations. A first 
comparison of cosmological reionization simulations with 
different source parameters presents the expected qualita¬ 
tive behaviour of the physics at play. 

EMMA has been tested in a wide variety of parallel config¬ 
urations and it demonstrates satisfying scaling properties. It 
is able to use graphics processing units (GPUs) to accelerate 
hydrodynamics and radiative transfer calculations. Depend¬ 
ing on the optimizations and the compilers used to generate 
the GPU reference, global GPU acceleration factors between 
x3.9 and xl6.9 can be obtained. Vectorization and transfer 
operations currently prevent better GPU performances and 
we expect that future optimizations and hardware evolution 
will lead to greater accelerations. Overall we demonstrate 
that EMMA is able to cope efficiently with a variety of hard¬ 
ware. 

Aside from optimization to improve the code perfor¬ 
mance and GPU-driven acceleration factors, additional fea¬ 
tures will be included into EMMA in a near future. Among 
them, star formation and supernovae feedback is a major 
priority as it is an essential ingredient for galaxy forma¬ 
tion theories and models. Their implementation is on the 
way and will be the described in a forth coming paper. 
Molecular chemistry is envisioned too as it is physically rel¬ 
evant to understand the formation of the first and smallest 
objects during the reionization epoch (like mini-halos with 
M < 10^Mq) and also numerically interesting as such cal¬ 
culations can be easily accelerated. Full documentation of 
the code is also on the way in order to publicly release EMMA 
on a finite, maybe short, term. 


© 0000 RAS, MNRAS 000, 000-000 


26 


D. Aubert et al. 



Figure Al. The halo mass function of a pure dark matter 
100/i“^Mpc/256^ cosmological run. Dots stand for the mass func¬ 
tions measured in the simulation at z=5.4, 3.3, 1.0 and 0.0 (from 
bottom to top). Lines stand for the |Sheth et al.| ( |2001| > expression 
for the halo mass function. 
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APPENDIX A: ADDITIONAL TESTS ON 
COSMOLOGICAL SIMULATIONS 

Al Mass function in a pure DM simulations 


Eirst, we try to recover the halo mass function in pure dark 
matter cosmological simulations. Initial conditions were pro¬ 
duced with the MPGrafic package ( Prunet et al.|[2008 ) for 
a 100h~^ comoving Mpc box sampled with 256"^ particles. 
The gravitational potential is computed on a 256^ coarse 
grid (i = 8) and refinement up to ^ = 12 is triggered in a 
quasi-Lagrangian manner when a cell contains more than 8 
particles. The spatial resolution is equivalent to a 4096^ grid. 
Cosmological parameters were taken from [Planck Collabo-| 
ration et al. ( 2013| (set ting Qrn = ^c) and used as inputs 


to the Eisenstein Hu] ( |1998| ) transfer function. The halos 


^ http://dhmunro.github.io/yorick-doc/ 



Figure A2. Error on cosmic energy variation as defined in Eq. 
jAlj for three 100 Mpc/h adiabatic DM+gas cosmological simula¬ 
tions with 128^ coarse resolutions {i = 7). From top to bottom 
a simulation without AMR (solid), with ^max = 8 (dashed) and 
^max = 10 (dotted). 


were detected using the HOP halo finder (Eisenstein & Hut 


1998) and their mass function is compared to the formu¬ 


lation of Sheth et al. (2001). Only halos with a number of 
particles greater than 10 particles were kept, corresponding 
to a minimal mass of 5.2 x 10^°M©. 

Eig. |A1| presents the halo mass function obtained at 


different redshifts, directly compared to Sheth et al. (2001). 
A good agreement is obtained at all redshift, with massive 
halos kicking in only at later time as expected. On the low- 
mass end of the mass function, EMMA is complete for halos 
with at least ~ 500 particles and a factor of two below full 
completeness for halos with ~ 100 particles. These numbers 
are standard for such AMR codes and overall we can con¬ 
clude that EMMA tracks correctly the assembly history of dark 
matter halos. 


A2 Energy ’Conservation’ in DM+gas adiabatic 
simulations 

In this section, we probe the energy conservation of an adi¬ 
abatic cosmological run (DM+gas). In supercomoving vari¬ 
ables, the cosmic energy varies between expansion factors ai 
and a 2 according to 


ra2 .J 

(t + u + e;)|S? = / -da 


J ai 


(Al) 


for ay = 5/3 gas. T, U and E stand for the total super¬ 
comoving kinetic energy, potential energy and internal gas 
energy. Three 100 /i“^Mpc 128^ (^ = 7) simulations with 
^max = 7, 8,10 where run, using the same cosmological pa¬ 
rameters and refinement strategy as in Sec. jAlj I n the same 
spirit as Kravtsov et al. ( 1997), we check Eq.jAlj against the 


change in potential energy, i.e.: 


© 0000 RAS, MNRAS 000 , 000-000 
































EMMA 27 


(f+(7+s)i“?-/;;f 
u\:i 


da 


(A2) 


In Fig. |A2[ the error is shown for 3 maximum level of 
refinement : -£max = 7 (i.e. no refinement), £ma^ = 8 and 
^max = 10. The error is found to be under control at 2%, 
1.2% and 0.7% respectively. One can note how the 3 tracks 
diverge as rehnement levels are installed (e.g. a = 0.12 for 
£ = 8 and a = 0.2 for £ = 9) providing greater resolution 
and smaller energy drifts. Overall, these levels of error is 


consistent with e.g. Kravtsov et al. (1997); Teyssier (2002). 


APPENDIX B: PRELIMINARY COMPARISON 
OF EMMA PERFORMANCES ON DIFFERENT 
GPU DEVICES 

We briefly describe the timings obtained by EMMA on K20c 
devices, more recent than the M2090 GPUs available on 
Curie. K20c devices differ by the number and type of cores 
(2496 Kepler cores versus 512 Fermi cores for the M2090), 
core clock (706 MHz Vs 1.3GHz for the M2090), memory fre¬ 
quency (2.6 GHz versus 1.9 GHz for the M2090) and band¬ 
width (208 Gb/s versus 177 Gb/s for the M2090). In terms 
of single precision floating point operations, the theoretical 
peak performance of K20c is a factor of 2 greater than the 
M2090. 

We ran a 4 Mpc/h cosmological simulation on a single 
GPU with full physics over 100 time steps, with the same 
settings as the one chosen in Sec. I^and]^ Fig. |B1 1 compares 
the duration of the time steps obtained from two simulations 
made on these two kind of devices. In both case, the timings 
show the same global evolution with spikes due to outputs 
of data and large jumps due to AMR refinement. K20 per¬ 
formances are marginally better than the M2090 ones, at 
the 10% level, despite their greater computing power. It is 
expected since, EMMA calculations on GPU are already domi¬ 
nated on M2090 devices by gather/scatter and host to device 
transfer operations . Future devices with greater bandwidth 
could improve the situation, but in its current state EMMA 
does not really benefit from the greater computing power of 
more recent hardware. 
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